The QA Agent and Self-Test Loops

Tier 3 · Real Build 8 min read

agents/qa_agent is the system’s Playwright-backed testing agent. It generates tests from PRDs, runs them against live or preview URLs, reports visual diffs, and either escalates failures or — via the self-test loop — tries to fix them autonomously.

What the QA agent is

The qa_agent was described in its projects/chat-interface-v2/ spec as: “AI QA agent — generate+run tests from PRDs.” Its CLI has five modes: run, parse, explore, mobile, and baseline.

It uses Playwright for browser automation. For each project, a qa.yaml file defines the test surface: routes to check, user stories to exercise, visual baseline thresholds. The qa-ship.sh wrapper installs the qa_agent as a git pre-push hook — armed projects cannot push without passing the QA gate.

Three live projects had qa.yaml shipped (CHANGELOG 2026-05-20): Hivehood (11/11 stories), nerve-center-v5 (16/16), and chat-cockpit (7/7). The QA runner’s two latent bugs that were caught and fixed during that pass: qa-ship.sh was swallowing its exit code (always appearing to pass), and the runner’s domcontentloaded event was firing before NextAuth redirects had completed.

Both are good examples of the class of bug that only appears when you actually run the tests against a live product, not when you read the test code.

The qa-loop

The qa-loop (qa-loop.sh) is a bounded autonomous repair cycle for caught test failures. Shipped as part of the loop trio (CHANGELOG 2026-06-08):

test → fix → retest → cap-3 → escalate

The cap is non-negotiable: three fix attempts maximum. If the tests still fail after three iterations, the loop escalates to Telegram rather than continuing to spin. This prevents a QA loop from burning tokens on a failure that requires human insight.

The loop is opt-in for live targets. You do not run a repair loop against a production URL. It is designed for staging and preview environments where a failed fix can be torn down and tried again cleanly.

Visual baselines

Playwright screenshot comparison catches regressions that unit tests miss. When a layout change shifts the grid at a specific viewport, or a z-index change buries a button, or a CSS cascade order change (caught in CHANGELOG 2026-06-03, PR #112) makes a 4-column grid render as 2-column — none of those have clean unit-testable descriptions. The visual diff does.

The nc5 watchdog test that caught the CSS specificity bug is the clearest example: a !important modifier changed which rule won at viewport width ≥1800px. A “computed-style assertion in a real jsdom stylesheet replicating the prod emission order” was needed to catch it, not a class-string assertion. The class-string assertion had missed it for a full iteration cycle.

When you add visual baselines, two things matter: the baseline must be captured in a state you consider correct (not from a broken screenshot), and the diff threshold must be calibrated per-project (tight for static UI, looser for data-driven content that changes each run).

The architect scope rule

The qa_agent fires on all products. The architect agent (agents/architect) fires only on nerve-center products. This rule exists because a mistake was made.

In April 2026, a batch of visual QA work was done with ad-hoc Playwright agents instead of routing through the existing agents/qa_agent. Wasted tokens: approximately $3. Lost opportunity: no visual baselines were fed into the canonical regression store.

The rule now: BEFORE spawning an ad-hoc agent for any task type, check ~/agent-system/agents/ for an existing specialist. bash ~/agent-system/scripts/list-agents.sh qa returns qa_agent. Use it.

The inverse mistake is also documented: the architect agent was invoked on non-nerve-center products (a client app, a standalone CLI), where its assumptions about NC5 wiring and cockpit conventions were wrong and caused incorrect gating. The scope rule prevents both errors:

QA work → always route through qa_agent
Architecture review → architect agent only for nerve-center products; industry-standard Playwright tests for everything else

How project lifecycle connects to QA

clawd-new-project.sh now installs a QA gate automatically on every new project:

Creates an ARCHITECT-BRIEF.md at project root
Scaffolds a qa/ directory
Installs a pre-push hook (dormant until armed with a qa.yaml)

clawd-finish-project.sh runs the architect signoff (for nerve-center projects) and the QA gate at close. Every project is governed from start to finish: the brief defines what success looks like, the QA gate checks it before the project is marked done.

What 79 documented user stories look like

The most thorough QA pass in the CHANGELOG (2026-06-04) ran 79 user stories across 11 categories against nerve-center-v5.vercel.app. Results:

36/36 routes render OK
11/12 read-only acceptance criteria pass
Live chat round-trip passes (via a disposable session: spawn → PONG → deleted+archived, no live-session pollution)
3 real bugs surfaced: a 404 on /docs/index, a home token-spend showing $0 while the cockpit showed $968, and domain cards showing stale ages despite the system reporting them active

The three bugs are characteristic: two are data-source divergence (different pages reading different sources for the same number), one is a missing route. All three were found by testing the actual product against the stated acceptance criteria, not by reading the code.

Next: The self-healer takes QA failures a step further. The Self-Healer and Evolution Engine covers outcome tracking, circuit-breakers, and the scout-propose-build loop.