Resources · Craft & Compliance

Testing without testing yourself

The agent owns the test suite. You don't write tests, you don't run tests, you don't review screenshots that pixel-matched. The carve-out: subjective things only — does this email look right?

This is genuinely a different model from how testing usually works. It's only possible because the agent does the work. If you're coming from a world where the founder reads every PR diff and the QA contractor walks the staging site at every breakpoint, set that down. We don't live in that world anymore.

The model — agent-owned testing

You are not an engineer. You shouldn't be writing test cases, you shouldn't be reading them, and you shouldn't be the loop that catches regressions. The agent is the engineer. It ships tests as it ships code, runs the suite on every change, reads visual regression diffs, fixes what's broken, and only escalates to you when the question is genuinely subjective — the kind of question where the right answer lives in your taste, not in a spec.

This is not "the agent helps with testing." This is "testing is the agent's job, end to end." The artifacts — specs, baselines, axe results, route asserts — exist for the agent's benefit, not yours. You will never look at a .spec.ts file in this workflow. The signal you get from testing is binary: red, the agent is working on it; green, ship.

The reason this works in 2026 and didn't in 2022 is that capable coding agents now do the bulk-tedious parts of testing — writing exhaustive unit cases, stabilizing flaky specs, triaging visual diffs — at a fraction of the wall-clock cost a human paid. The constraint that used to make full-coverage testing uneconomic was human attention, and that constraint is gone for the parts a machine can do.

The three layers

Testing splits into three layers, each with a different job. Together they catch the things humans don't.

Unit (Vitest). Isolated logic, no I/O, milliseconds per case. Pure functions, reducers, form validators, date math, money math, parsers. The job: prove the small pieces are correct so you don't waste integration runs debugging a bad helper.

Integration (Vitest). Real database, real external SDKs (against test accounts, sandbox keys, or local equivalents like a Postgres in Docker), seconds per case. The job: prove that "the code I wrote actually talks to the systems I think it talks to." This layer is where ORM schema drift, missing indexes, broken webhook signatures, and "it works locally with mocks" all surface.

End-to-end + visual regression + accessibility (Playwright + axe-core + toHaveScreenshot). Real browser, real rendered pages, minutes per full sweep. The job is everything that humans used to do by hand: "does the page render," "does the button work," "does it look right at every breakpoint and theme," "is it usable with a screen reader." Playwright drives the browser, axe-core asserts no WCAG 2.1 AA violations, toHaveScreenshot does pixel-diffing against committed baselines.

The three layers report independently. A red unit test is a code bug. A red integration test is an integration bug. A red visual regression is either a code bug or an intentional change that needs a baseline update — the agent decides which.

Sub-skill 02 scaffolds the rig before any feature ships

The test rig exists before the product does. When sub-skill 02 (Design) wraps, the project already has:

A working Vitest config with separate unit and integration projects.
A working Playwright config with 5 viewport breakpoints committed: mobile-sm (375), mobile-lg (390), tablet (768), desktop (1280), wide (1920).
axe-core wired into the e2e config so every Playwright page automatically gets an a11y assertion.
A baseline screenshot directory at tests/e2e/visual.spec.ts-snapshots/ with the landing page captured in both light and dark themes, committed to git.
npm scripts: test:unit, test:integration, test:e2e, test:visual, test:visual:update, test:a11y, test:crawl.
Three template tests (one per layer) that pass on a clean checkout. These exist so future tests have a working pattern to copy and so the rig itself is verified end-to-end before any feature is built on top of it.

From sub-skill 03 onward, every sub-skill that ships code also ships tests. There is no "we'll add tests later" phase. There is no "we'll add tests before launch" phase. The rig is hot from day two of the project, and shipping a feature without its tests is a workflow violation that the agent's operating rules explicitly forbid (see SKILL.md, Tests are agent-owned).

The pixel-diff workflow — why re-runs are cheap

Visual regression with Playwright's toHaveScreenshot does the heavy lifting for "does this look right." Every page, at every breakpoint, in both themes, gets pixel-compared against a committed baseline. There are exactly three outcomes per run:

All matched. The agent does not inspect the screenshots. The pixel-match is the contract. There is nothing to look at.
Some mismatched. The agent reads each diff PNG, judges intentional vs. regression vs. flaky, and takes one of three actions: update the baseline (intentional change), fix the code (regression), or stabilize the spec (flaky — disable an animation, mask a timestamp, wait for a specific selector instead of a timeout).
Mass mismatched (>30% of screenshots). The agent investigates the root cause first. A bulk update at this scale almost always means something genuinely broke (a global token changed, a font failed to load, a layout primitive regressed). Reflexively running test:visual:update here is how real regressions get baked into the baseline. The operating rule is: never bulk-update without an explanation.

The whole reason re-running the suite is cheap is that matched screenshots aren't re-inspected. The cost of "is the homepage at 1280px in dark mode still correct" drops to a pixel comparison; the agent only spends real attention on the diffs.

Honest trade: visual regression is sensitive to font anti-aliasing differences across operating systems. False positives happen — a baseline captured in CI on Linux will mismatch slightly when re-rendered on macOS. The fix is to stabilize the spec (pin the OS in CI, lock the font load, mask the affected region) rather than loosen the pixel-diff threshold. A loose threshold catches nothing.

The carve-out — what the user actually validates

Almost nothing. The agent owns functional correctness. You only weigh in on subjective questions, of which there are roughly three:

Email rendering. The agent's dev-debug helper writes every outbound email template to tmp/emails/ as both HTML and a rendered PNG. Once per project, you're asked whether to also send a real one to your inbox so you can confirm it survives Gmail's renderer. After that, the agent inspects the PNGs.
Brand vibe pass. Open the deployed site. Answer "does this feel on-brand?" The agent cannot answer this for you because the answer lives in your taste.
One-paragraph copy review. For any user-facing prose the agent wrote without explicit input from you — 404 page, empty states, marketing sub-headlines — the agent asks you to skim a paragraph. You either approve or rewrite.

That's it. If you find yourself being asked to review a Playwright spec, click through a checkout flow, or open a screenshot diff, something has gone wrong with the workflow. Push back.

Race-condition + timing tests

These are the bugs that don't show up in single-user manual testing and do show up the day you have ten paying customers. They get checked at design time and tested explicitly. The patterns the agent always covers:

Double-submit prevention. Idempotency keys on the server, submit-locked buttons in the UI. Test asserts that two rapid clicks produce one row.
Out-of-order webhook delivery. Compare event.created timestamps; ignore arrivals older than the last-applied event. Test simulates a Stripe webhook arriving out of order and asserts the older one is dropped.
Concurrent edits. Optimistic locking via updated_at, surfaced as a 409 with a real conflict UI. Test loads the same record in two contexts and asserts the second save shows the conflict.
Slow-response half-state. The loading state is visible the entire time the request is in flight, the button is disabled, and the success state only appears after the actual completion. Test throttles the network and asserts no half-rendered state is reachable.
Optimistic UI rollback. When the optimistic update fails, the UI rolls back and surfaces the failure. Test forces a server error and asserts the UI returns to its pre-action state.

These tests live in tests/e2e/race.spec.ts as a discoverable class. New race conditions get added there so the next agent reading the file knows where they belong.

User-profile-aware persona tests

Generic e2e tests catch generic bugs. They do not catch the bugs your specific users will hit, because your users are not generic. The agent reads the audience definition in PROJECT.md and writes 2-4 personas that reflect how those users actually behave.

Examples we've shipped: a hands-busy cook who can only tap (no keyboard, no hover); a multi-tab planner who has the same workflow open in three tabs and expects state to stay coherent; an invited B2B user whose first session starts on an invitation link and whose role is not "owner"; an audit-mode user who expects every destructive action to be logged and undoable.

Each persona's distinguishing characteristic — the thing that makes them not the median user — drives at least one test. Files live at tests/e2e/personas/<persona-slug>.spec.ts. When PROJECT.md's audience changes, the persona suite changes with it.

Sub-skill 16 is the final regression pass, not the place where tests get created

By the time sub-skill 16 (final regression) runs, every prior sub-skill from 03 through 15 has already shipped its own tests. Sub-skill 16's job is not to write the test suite. Its job is to:

Run the full suite end to end and report any reds.
Walk every route declared in tests/e2e/routes.ts, asserting 2xx, no console errors, link resolution (no broken hrefs), and button reachability (every interactive element is keyboard-focusable and not occluded).
Inspect any visual diffs that the per-feature runs flagged but didn't fully resolve.
Run the production smoke pack (the flow specs, retargeted at the deployed URL) once the deploy lands.

If sub-skill 16 finds a regression, it is by definition a regression in something an earlier sub-skill claimed to ship correctly. The fix goes back to that sub-skill's territory; sub-skill 16 doesn't paper over it with a new test file.

The cost story

A typical full suite run on an MVP — unit + integration + e2e at five breakpoints in two themes plus axe + the route crawl — takes 2-5 minutes. A typical mismatched-only inspection takes the agent about 30 seconds per diff. Compare to the alternative, which is a human hand-walking every route at every breakpoint at every theme on every change. That alternative isn't "expensive." It's impossible. Nobody does it. The reason agent-owned visual regression is a genuine step-change is that it makes the impossible-by-hand suite cheap-by-machine.

The pixel-diff against a committed baseline is what makes re-runs cheap: the matched 95% of screenshots cost nothing to "re-verify" because nobody re-inspects them. The agent only pays attention where attention is needed.

What "tests are agent-owned" means for you

Three things, concretely:

You don't write test cases. The agent does. You will not see a describe( block in the work the agent asks you to review.
You don't run tests. The agent does, on every change, before showing you anything.
You don't review screenshots that pixel-matched. Neither does the agent. Pixel-match is the contract.

The only signals you ever see from the testing system are red — something's wrong, the agent is working on it, you don't need to do anything — or green — ship. Anything else is a workflow drift, and it's worth pushing back on.