Root-Cause-First as a Build Process
Before this, read:
- Prompting the CLI well — scope discipline applies to bug-fixing too
The bandaid is fast. Fix this one file, close the ticket, move on. The problem with bandaids is that the same bug class comes back — slightly different location, same underlying cause — and you patch it again. And again. At some point you’ve patched twelve instances of the same process failure instead of fixing the process once.
Root-cause-first is the alternative. It’s slower on any single bug. It’s faster across the lifespan of the system.
The 5 steps, in order
Section titled “The 5 steps, in order”-
Diagnose the root cause. Not “what broke” but “why did this system allow this to break?” Ask why repeatedly until you hit a process gap, not just a code line. The surface symptom is the last step, not the first.
-
Fix the process. The script, the cron, the schema, the convention, the hook — whatever generated this class of bug must stop generating it. One change that prevents recurrence.
-
Backfill existing instances. Run one automated pass over all the existing cases of this bug. Not a list of follow-up tickets. One sweep, now.
-
Add the regression guard. A check, a CI gate, a watchdog, a periodic sweep — something that will catch the bug class if it appears again and surface it before it silently breaks for a week.
-
Patch the surface symptom that JD originally noticed. Often redundant after step 3. That’s fine — by design.
This order is non-negotiable. Steps 1–4 before step 5.
Real examples from the changelog
Section titled “Real examples from the changelog”The PATH-in-cron bug
Section titled “The PATH-in-cron bug”Symptom: Domain heartbeats and the CRM sync stopped running correctly. Nothing in the logs.
Surface patch: Restart the heartbeat scripts manually.
Root cause (actual): The cron environment doesn’t inherit the interactive PATH. /opt/homebrew/bin/python3.12 isn’t in the cron PATH, so python3.12 resolves to the wrong interpreter or fails entirely. Two crons were found running under Python 3.14 (a non-standard install) because of this. The heartbeats worked fine in an interactive terminal and failed silently in cron.
Process fix: Every cron that invokes Python gets a standard prelude:
cd ~/agent-systemexport PATH="/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin"export PYTHONPATH="$HOME/agent-system"source ~/agent-system/.env 2>/dev/nullBackfill: One audit pass across all active cron jobs; the prelude was added to every agents.* cron that was missing it.
Regression guard: domain-integrity-check.sh (daily at 09:17) now flags any agents.* cron line missing the cd/PYTHONPATH prelude.
Symptom patch: Manually restarted the affected heartbeats — now redundant, since the process fix covered them.
The WORKPLAN “status” drift bug
Section titled “The WORKPLAN “status” drift bug”Symptom: Task T0.1 showed as “open” in the WORKPLAN but the CHANGELOG showed it as shipped.
Surface patch: Edit T0.1 manually.
Root cause (actual): clawd-log.sh only wrote to the CHANGELOG. It did not update the WORKPLAN Status: marker. The pre-injection that feeds JD’s morning briefing was also capped at 10 bullets, missing anything after the 10th entry.
Process fix: Patched clawd-log.sh to accept --task ID and write the status marker; bumped the pre-injection cap to 30.
Backfill: One automated sweep of existing CHANGELOG entries → WORKPLAN markers.
Regression guard: A nightly reconcile script that fails if WORKPLAN items are marked open but their IDs appear in the last 30 days of CHANGELOG.
Symptom patch: T0.1 updated — covered by backfill.
The crontab wipe
Section titled “The crontab wipe”Symptom: All 148 cron jobs vanished.
Root cause (actual): A crontab -l | python -c <inline> | crontab - editing pattern where the inline Python had a SyntaxError. Python exited with an error, producing empty stdout. crontab - with empty stdin replaces all jobs with nothing.
Process fix: “Never pipe inline expressions to crontab -.” Write to a temp file, validate, then install. This is now a hard rule, not a guideline.
Backfill: Restored from the most recent backup + re-applied all known deltas.
Regression guard: A daily crontab auto-snapshot cron (added the same day) bounds future loss to under 24 hours.
Symptom patch: All 148 jobs restored.
When the bandaid is explicitly chosen
Section titled “When the bandaid is explicitly chosen”Sometimes the right decision is: “bandaid this one, root-cause fix later.” That’s a valid call when the root-cause fix is a 30-minute build and the situation is blocking something more important.
The rule when this happens: open a loop. The surface symptom gets the bandaid; the root-cause fix gets an open-loop entry with an owner and a timeline. If the loop isn’t opened, the bandaid is a permanent fixture — and in six months you’ll forget why that manual workaround exists.
Push back on the bandaid when you can. The framing: “bandaid is 5 minutes, root-cause fix is 30 minutes, bandaid will recur in a different form next month. Recommend the 30-minute fix.” Then do it, unless JD explicitly says to defer.
What this looks like in an agent prompt
Section titled “What this looks like in an agent prompt”When spawning a build agent to fix a bug, include the root-cause-first doctrine explicitly:
Fix the bug. Follow the root-cause-first process:1. Diagnose why the system allows this class of bug.2. Fix the process that generates it.3. Backfill existing instances.4. Add a regression guard.5. Then patch the surface symptom (often redundant after step 3).
Do NOT stop at step 5. If you can only do the surface patch in this session, open a loop for the root-cause fix and say so in your report.Without this instruction, most agents default to step 5. The doctrine has to be in the prompt.
Next: The PR → CI → merge → deploy → smoke flow — and why JD reviews in prod, not in pull requests. Ship-to-prod workflow.