I tried auto-generating E2E tests with Playwright Test Agents

I used the planner/generator/healer of Playwright Test Agents and this time ran them as sub-agents of Claude Code. This is a record of trying out everything from E2E test planning, generation, and automatic repair to HTML report display, using JPetStore as the subject matter.

藤川

2026.06.11

This page has been translated by machine translation. View original

Introduction

I'm Fujikawa from the Data Business Division. How much E2E testing do you write?

Until now, setting up E2E tests required painstaking work. You'd open screens, inspect selectors, write test code, run it, watch it fail, then fix the selectors. UIs change frequently. When applications change, you maintain broken tests. The cost of maintaining E2E test code becomes enormous.

We can now have generative AI write test code. This time, we'll actually set up an application server and use Playwright Test Agents to automatically generate E2E test code. Let's experience the feature of getting "tests that work from the start."

Overview

Playwright Test Agents consists of three agent definitions with different roles.

Playwright has a mechanism called Playwright Test Agents that lets AI agents create E2E tests. Three agents—planner for creating test plans, generator for generating test code, and healer for repairing failed tests—build tests while actually operating a browser.

Agent	Role
`planner`	Explores the app in a real browser and creates a test plan (Markdown)
`generator`	Generates test code while executing each scenario in a real browser
`healer`	Runs and debugs failed tests, identifies causes, and fixes the code

For the application server, we'll use JPetStore 6, a Java-based pet store e-commerce site. We'll unify runtime environment setup with mise and Node package management with pnpm (with a policy of not using npx), and go all the way through displaying reports with HTML reporter.

Preparation

Verification Environment

macOS (Apple Silicon)
Playwright 1.60.0
Claude Code
mise 2026.4.8

Getting JPetStore and Building the Runtime

First, clone the JPetStore repository.

git clone https://github.com/mybatis/jpetstore-6.git
cd jpetstore-6

Next, create mise.toml directly under the repository to install Java, Node, and pnpm together.
JPetStore itself targets Java 17, but the parent POM (mybatis-parent) enforcer plugin requires JDK 21 or higher at build time, so builds will fail with temurin-17. Specifying temurin-21 resolves the issue.

[tools]
java = "temurin-21"
node = "lts"
pnpm = "latest"

Run the following commands, and if each tool's version is displayed, you're good to go.

mise trust && mise install
mise exec -- java -version   # Temurin 21
mise exec -- node -v         # v24 series (LTS)
mise exec -- pnpm -v

Starting the Application Server

Build the WAR and start with Tomcat 9. Since the tomcat9 profile of the cargo plugin is enabled by default, the -P option is not needed.

mise exec -- ./mvnw clean package -DskipTests
mise exec -- ./mvnw cargo:run

Open http://localhost:8080/jpetstore/ in your browser, and if the JPetStore top page appears, preparation is complete.

Playwright Setup

Install @playwright/test and Chromium with pnpm.

mise exec -- pnpm add -D @playwright/test@latest
mise exec -- pnpm exec playwright install chromium

In playwright.config.ts, configure HTML reporter and JPetStore's baseURL.

import { defineConfig, devices } from '@playwright/test';

export default defineConfig({
  testDir: './tests',
  fullyParallel: true,
  reporter: [['list'], ['html', { open: 'never' }]],
  use: {
    baseURL: 'http://localhost:8080/jpetstore/',
    trace: 'retain-on-failure',
  },
  projects: [{ name: 'chromium', use: { ...devices['Desktop Chrome'] } }],
});

Initializing Test Agents and MCP Configuration

Initialize Test Agents.

Running the playwright init-agents command generates agent definitions (under .claude/agents/) and MCP server settings (.mcp.json) for agents to operate the browser and test runner. You can select the AI agent environment to connect with using the --loop option. Specify Claude Code to run each Test Agents agent as its sub-agent.

mise exec -- pnpm exec playwright init-agents --loop=claude

The following files are generated.

.
├── .claude/
│   └── agents/
│       ├── playwright-test-planner.md    # planner agent definition
│       ├── playwright-test-generator.md  # generator agent definition
│       └── playwright-test-healer.md     # healer agent definition
├── .mcp.json                             # MCP server configuration
├── specs/
│   └── README.md                         # Location for test plans
└── tests/
    └── seed.spec.ts                      # Seed test used as a starting point by agents

The generated .mcp.json is configured to start the MCP server with npx. Since we have a policy of not using npx this time, we rewrote it to directly specify the locally installed binary.

{
  "mcpServers": {
    "playwright-test": {
      "command": "/path/to/jpetstore-6/node_modules/.bin/playwright",
      "args": ["run-test-mcp-server"]
    }
  }
}

When you restart Claude Code, it will ask for approval of .mcp.json. Once approved, it connects to the playwright-test MCP server and the agents become available.

Trying It Out

Creating a Test Plan with planner

Run the planner agent in Claude Code. Type @planner and the following will appear as a candidate to confirm.

@.claude/agents/playwright-test-planner.md

We requested a test plan for JPetStore's main flows (catalog browsing, search, sign-in, cart, order, new registration). The planner actually launched Chromium, operated JPetStore thoroughly, and created a test plan with 6 suites and 8 scenarios as specs/jpetstore-core-flows.md.

Navigate the catalog from home → category → product → item detail
Keyword search (with results)
Keyword search (no results)
Sign in / Sign out
Sign-in failure with incorrect password
Add to cart → change quantity → delete
End-to-end from sign-in to order confirmation
New account registration

The plan contains specific details confirmed on actual screens, including link text, button labels, and expected values (prices and message wording), making it quality that could serve as a manual test procedure document as-is.

Generating Test Code with generator

Normally, when planner runs, generator is also processed.
The generator agent is launched for each scenario to generate test code. Since agents share a single MCP browser, they must be executed one at a time sequentially, not in parallel. Each one takes about 2 to 15 minutes to write code while replaying the planned steps in a real browser.

As an example of a generated test, here is an excerpt of the new account registration test.

test.describe('Account Registration', () => {
  test('Register a new account with a unique username', async ({ page }) => {
    const username = `testuser${Date.now()}`;

    // 2. Click the 'Register Now!' link
    await page.locator('a[href*=\'newAccountForm\']').click();

    // 3. In the 'User ID:' field, enter a unique username based on the current timestamp
    await page.locator('input[name="username"]').fill(username);
    await expect(page.locator('input[name="username"]')).toHaveValue(username);
    // (omitted)
  });
});

There are two notable points.

JPetStore's in-memory DB rejects duplicate usernames. By adding a timestamp to the username to make it unique, the test can be re-run any number of times.
The Stripes framework used by JPetStore dynamically generates IDs for form elements. The generator examined the actual page and independently chose the more resilient input[name="..."] format locator. This is a judgment only possible because it writes while verifying in a real browser.

Running the Tests

Let's run the generated 8 tests (plus the seed test).

mise exec -- pnpm exec playwright test

The result was 8 passed / 1 failed. The catalog browsing test failed with the following error.

Error: expect(locator).toBeVisible() failed

Locator:  getByRole('link', { name: 'Fish' })
Expected: visible
Received: hidden
  - locator resolved to <area alt="Fish" shape="RECT" ... />

The cause was that the locator resolved to the image map (<area alt="Fish">) in the center of the catalog page. Since <area> elements have no layout box per the HTML specification, Playwright always treats them as hidden.

Repairing Tests with healer

Here we use the healer agent. Type @healer and the following is entered.

@.claude/agents/playwright-test-healer.md

When you pass the failing test, healer re-runs it to reproduce the failure, investigates the page state in debug mode, and fixes the locator to an actual anchor element in the sidebar.

// Before fix: resolves to <area> and treated as hidden
await expect(page.getByRole('link', { name: 'Fish' })).toBeVisible();

// After fix: directly specifies an actual anchor element
await expect(page.locator('#QuickLinks a[href*="categoryId=FISH"]')).toBeVisible();

After the fix, running the full suite yielded 9/9 passed. We confirmed all tests pass even in a fresh state after restarting the server, so reproducibility is not an issue.

  9 passed (5.2s)

Displaying the HTML Report

Finally, display the report with HTML reporter.

mise exec -- pnpm exec playwright show-report

http://localhost:9323 opens in the browser, where you can check execution results per suite, time taken for each step, and traces on failure.

Conclusion

As you can see, we reached the full cycle of plan → generate → execute → repair → report without writing a single line of test code by hand. Actually trying it out revealed several discussion points.

generator writes code while verifying one step at a time in a real browser. healer autonomously identifies the cause and sees the fix through to completion.
Generation takes 2 to 15 minutes per test, and agents are premised on sequential execution. While faster than a human writing from scratch, it's not something that should be generated every time in CI. The division of use will be "generate during development, run in CI." Token consumption by AI agents directly impacts costs. This is considered the optimal division of use from the perspective of time and cost.
The generated tests fix the "current behavior" as the ground truth. Whether those assertions are correct as specifications is still a human judgment. Reviewing test code is indispensable. I would like to introduce how to make this more efficient on another occasion.

Until now, there have likely been many projects that didn't write E2E tests. Let's leverage Playwright Test Agents to improve quality with E2E testing. I hope this is helpful to someone.