The Self-Healer and Evolution Engine

Tier 3 · Real Build 9 min read

The self-healer and evolution engine are two related but distinct pieces. The self-healer detects and fixes known failure classes automatically — lock sweeps, log rotation, process restarts. The evolution engine is broader: it reads the system’s own logs, identifies root causes, proposes fixes ranked by impact, and runs isolated build agents to ship those fixes after JD approves.

Together they form the loop that JD described on LinkedIn: “This morning my agent system found 7 bugs in itself… seven build agents are shipping the fixes.” That post was accurate. The 7-bug run is confirmed in CHANGELOG 2026-06-08.

The self-healer

agents/evolution/self_healer_autofix.py runs on a cron and handles a safe whitelist of automatically-fixable failures:

Stale lock sweeps
Log rotation when logs exceed 50 MB
com.jd.claude-telegram-daemon launchd kickstart
Telegram tmux session recovery

Everything outside that whitelist gets escalated to Telegram as “needs attention.” The whitelist is intentionally narrow — automating a repair that could mask a deeper problem is worse than surfacing the failure.

The hardening pass (CHANGELOG 2026-06-08 08:21) added three layers on top of the basic auto-fix:

Outcome tracking and circuit-breaker. After applying a fix, the self-healer verifies the fix worked. If the same failure recurs three times after the same fix, the circuit-breaker trips: no more auto-repair attempts, escalate immediately. This prevents the self-healer from spinning on a failure it cannot actually fix.

3-layer drift watchdog. Checks at three levels — input, decision, and output — on a cron at :23 and :53 past each hour. Input drift: is the data coming in as expected? Decision drift: are the right agents making decisions? Output drift: are the outputs landing where they should? Any layer that drifts triggers an escalation.

Blast-radius limiter. The self-healer maintains a list of protected and unknown resources. If a proposed action touches a protected resource (anything that could affect production data, JD’s real accounts, or live sessions), the limiter blocks it and escalates. Unknown resources also get blocked — anything the limiter doesn’t recognize gets flagged rather than acted on.

After the hardening pass, 48 tests were passing and the crontab had grown to 150 jobs.

The evolution engine

agents/evolution is the broader system. Where the self-healer handles known failure classes, the evolution engine handles discovery: what doesn’t it know about yet?

The scout runs twice daily (9:23 AM and 4:23 PM), reads the system’s own logs and error output, and writes ranked root-cause proposals to a file. The proposals include severity, blast radius, and a suggested fix. JD reviews them from Telegram and approves the ones worth building.

After approval, an isolated build agent (worktree-isolated, one agent per proposal) ships the fix with regression guards. The scout’s output from the 2026-06-08 run: 7 proposals, 7 approved, 7 build agents shipped fixes the same day.

Wave 2: what the evolution engine fixed about itself

Evolution wave 2 (CHANGELOG 2026-06-08 07:30) is a useful case study because the evolution engine fixed its own infrastructure:

Plaud retry/backoff — added retry logic for Plaud API failures
72-hour aging alerts — new recordings that sit unprocessed for 72 hours now trigger an alert
Arrival-report digests — a ledger of processed recordings was seeded with a 20-recording catchup backfill
OpenRouter handroll migration — 9 files that had hand-rolled OpenRouter calls were migrated to the shared llm_client with Anthropic fallback (the old pattern would fail silently when OpenRouter credits ran out)
New CI gate — openrouter-literal gate added to catch any new hand-rolled OpenRouter calls at PR time

All 7 evolution proposals were built the same day as proposed. 113 new tests landed with wave 2.

The CI gate is the regression guard for the OpenRouter migration. It is not enough to migrate the existing files — future code could re-introduce the pattern. The gate blocks it at PR time.

The earlier self-healer noise problem

The first version of the self-healer had a noise problem: it was alerting about the same 5 issues every 4 hours, even after they were resolved or acknowledged. The fix was per-issue 24-hour suppression via ~/agent-system/state/self-healer-notified.json.

It also had a false positive: it was reading a dead telegram-bot.log (0 bytes since April 5) to monitor bot activity, which showed no activity and triggered alerts. The fix was to read telegram-inbound-router.log instead, and downgrade the metric severity from warning to info.

Both problems are structural: a monitoring system that alerts about things the operator already knows about trains the operator to ignore it. Suppression windows and correct log sources are not optional polish — they are load-bearing.

The proposal → approve → build flow in practice

Evolution scout (cron 9:23 + 16:23)
  → reads ~/agent-system/logs/
  → writes ranked proposals to state/evolution-proposals.yaml

JD reviews via Telegram
  → approves subset (one tap per proposal)

CEO work-queue loop picks up approved proposals
  → each proposal becomes a worktree-isolated build task
  → build agent ships fix + regression guard + tests
  → commits to a branch, opens PR

QA gate on PR
  → auto-merge if green
  → escalate if not

The CEO work-queue loop (part of the loop trio, CHANGELOG 2026-06-08) is what makes the approved proposals actually execute without requiring a live session. Before the loop, an approved proposal sat in the queue until JD started a Claude session and ran it manually. The loop runs every 2 hours (8 AM – 8 PM) and drains the approved queue autonomously.

Numbers from the CHANGELOG

7 real bugs found and fixed in one weekend via the evolution loop (confirmed CHANGELOG 2026-06-08)
48 tests for the self-healer hardening pass alone
113 new tests from evolution wave 2
Circuit-breaker trips after 3 failed fix attempts
Blast-radius limiter blocks any action on protected or unknown resources
24-hour per-issue suppression on self-healer alerts

Next: Synthetic Users — agents that simulate users to exercise a product before real humans see it.