The Self-Healing Loop Pattern

Tier 2 · Building 9 min read

Before this, read:

Worktree isolation — each build agent in the loop needs its own worktree
The loops doctrine — the scout is a cron; the watchers are part of the approval flow
Ship-to-prod — the build agents in the loop follow the ship-to-prod flow

In June 2026, JD posted on LinkedIn: “this morning my agent system found 7 bugs in itself. I approved the fixes. Seven build agents are shipping them.” The claim is accurate. The infrastructure that made it happen is the subject of this article.

The self-healing loop is not magic. It’s a specific pipeline of four cooperating agents, each with a narrow job. Understanding the pipeline is what lets you build it.

The four agents in the loop

1. The scout (agents/evolution/scout.py)

Runs twice a day (10am and 4pm, per CHANGELOG 2026-06-08). Its job: read the system’s error logs, cron failure logs, and test outputs; identify recurring failures; write ranked root-cause proposals to agents/evolution/proposals.py.

Each proposal has:

A root cause statement
A specific fix description
A confidence score
A risk assessment (reversible, irreversible, blast-radius)

The scout does not build anything. It observes and proposes.

2. The approval gate (human — JD via Telegram)

JD reviews the proposals and approves or rejects each one. The agents/evolution/approve.py module handles this: proposals are sent to JD’s Telegram as a formatted list; JD replies with approve/reject per proposal.

This step is non-negotiable. Build agents don’t start until JD approves. The scout may be right about the bug and wrong about the fix; JD’s review catches that. Reversible fixes (adding a test, fixing a log format) are easy approvals. Irreversible fixes (schema migrations, behavioral changes to live systems) get more scrutiny.

3. Build agents (one per approved proposal, worktree-isolated)

For each approved proposal, the CEO work-queue loop spawns a worktree-isolated build agent with the specific objective from the proposal. Each agent:

Creates its own git worktree
Implements the fix per the proposal’s description
Writes or updates the regression guard (a test, a CI gate, or a periodic sweep)
Commits to its own feature branch
Reports pass/fail back to the CEO

The agents run in parallel. Seven proposals approved → seven build agents, each in its own isolated worktree, each working the same morning.

4. The regression guard

Every fix includes a guard. Not as an afterthought — the proposal spec requires it. The guard is part of what makes this a self-healing loop rather than a self-patching one. A patch without a guard means the same bug is just one bad edit away. A patch with a guard means the CI suite will catch it if it comes back.

What counts as a guard:

A new pytest test that fails if the bug reappears
A new CI gate (grep-gate, import-smoke extension, registry check)
A new periodic sweep that alerts on the bug class
A new assertion in an existing test that covers the new behavior

The build agent that doesn’t add a guard ships an incomplete fix.

What the “7 bugs” looked like

From the CHANGELOG 2026-06-07 → 2026-06-08, the evolution loop shipped fixes for:

Plaud retry/backoff — voice notes that failed to process were silently dropped; fix added exponential retry with a 72-hour aging alert
OpenRouter handroll migration — 9 files had hand-rolled OpenRouter HTTP calls that didn’t fall back to Anthropic on 402s; fix migrated them all to the shared llm_client module (and the CI grep-gate now blocks new handrolls)
Scout cron timing — scout was running at wrong times after a cron edit; retimed to 10:00/16:00
Socrates PBT evolution step orphan — the evolution step for Socrates’ population-based training had been un-cron’d after a refactor; fix registered the daily cron entry
Self-healer outcome tracking — the self-healer logged what it proposed but not what the build agent actually shipped; fix added outcome tracking
Drift watchdog — no watchdog for the NC5 checkout drifting from main; fix added nc5-watchdog.sh
Blast-radius limiter — build agents had no cap on how many system files a single fix could touch; fix added a limit and a check

Seven proposals, seven build agents, seven shipped fixes with regression guards, same day they were proposed. That’s the loop working as designed.

Building a minimal version

You don’t need all of this on day one. The minimum viable self-healing loop has three components:

A log reader that writes proposals. A script that reads your error logs, identifies recurring failures (anything that appears in the last 24 hours of logs more than 3 times), and writes them to a proposals file with a brief description. No scoring, no risk assessment — just “here are things that broke.”

A human review step. Don’t skip this. A fully autonomous “find bug → fix bug” loop that never involves a human will eventually auto-fix something in a way that makes things worse. The approval step is cheap (30 seconds to scan 7 proposals) and catches the edge cases.

Build agents that add tests. The fix is the easy part. The test is what makes it self-healing rather than self-patching. Every build agent’s prompt should include: “After implementing the fix, add a test that would have caught this bug. The CI suite must pass before you report done.”

Start there. Add confidence scoring, risk assessment, and blast-radius limits when you have a track record of proposals to calibrate against.

The honest caveat

The self-healing loop is real and it finds real bugs. It is also a system that proposes changes to itself — which means it can propose changes that are incorrect, incomplete, or that interact badly with other parts of the system. The approval gate exists precisely because the scout’s confidence scores are not infallible.

The loop made the system better over six weekends of proposals. It also, on one occasion, proposed a fix that would have changed a behavioral contract in a way that would have broken downstream consumers. JD rejected that proposal in review. The loop caught 7 bugs; it also needed to be told “no” once.

That ratio is fine. That’s what the human gate is for.

Next: When to scaffold a full project with a WORKPLAN and CHANGELOG vs. adding a line to a domain task list. The decision rule. Projects vs. domain todos.