Root-Cause-First as a Build Process

Tier 2 · Building 8 min read

Before this, read:

Prompting the CLI well — scope discipline applies to bug-fixing too

The bandaid is fast. Fix this one file, close the ticket, move on. The problem with bandaids is that the same bug class comes back — slightly different location, same underlying cause — and you patch it again. And again. At some point you’ve patched twelve instances of the same process failure instead of fixing the process once.

Root-cause-first is the alternative. It’s slower on any single bug. It’s faster across the lifespan of the system.

The 5 steps, in order

Diagnose the root cause. Not “what broke” but “why did this system allow this to break?” Ask why repeatedly until you hit a process gap, not just a code line. The surface symptom is the last step, not the first.
Fix the process. The script, the cron, the schema, the convention, the hook — whatever generated this class of bug must stop generating it. One change that prevents recurrence.
Backfill existing instances. Run one automated pass over all the existing cases of this bug. Not a list of follow-up tickets. One sweep, now.
Add the regression guard. A check, a CI gate, a watchdog, a periodic sweep — something that will catch the bug class if it appears again and surface it before it silently breaks for a week.
Patch the surface symptom that JD originally noticed. Often redundant after step 3. That’s fine — by design.

This order is non-negotiable. Steps 1–4 before step 5.

Real examples from the changelog

The PATH-in-cron bug

Symptom: Domain heartbeats and the CRM sync stopped running correctly. Nothing in the logs.

Surface patch: Restart the heartbeat scripts manually.

Root cause (actual): The cron environment doesn’t inherit the interactive PATH. /opt/homebrew/bin/python3.12 isn’t in the cron PATH, so python3.12 resolves to the wrong interpreter or fails entirely. Two crons were found running under Python 3.14 (a non-standard install) because of this. The heartbeats worked fine in an interactive terminal and failed silently in cron.

Process fix: Every cron that invokes Python gets a standard prelude:

cd ~/agent-system
export PATH="/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin"
export PYTHONPATH="$HOME/agent-system"
source ~/agent-system/.env 2>/dev/null

Backfill: One audit pass across all active cron jobs; the prelude was added to every agents.* cron that was missing it.

Regression guard: domain-integrity-check.sh (daily at 09:17) now flags any agents.* cron line missing the cd/PYTHONPATH prelude.

Symptom patch: Manually restarted the affected heartbeats — now redundant, since the process fix covered them.

The WORKPLAN “status” drift bug

Symptom: Task T0.1 showed as “open” in the WORKPLAN but the CHANGELOG showed it as shipped.

Surface patch: Edit T0.1 manually.

Root cause (actual): clawd-log.sh only wrote to the CHANGELOG. It did not update the WORKPLAN Status: marker. The pre-injection that feeds JD’s morning briefing was also capped at 10 bullets, missing anything after the 10th entry.

Process fix: Patched clawd-log.sh to accept --task ID and write the status marker; bumped the pre-injection cap to 30.

Backfill: One automated sweep of existing CHANGELOG entries → WORKPLAN markers.

Regression guard: A nightly reconcile script that fails if WORKPLAN items are marked open but their IDs appear in the last 30 days of CHANGELOG.

Symptom patch: T0.1 updated — covered by backfill.

The crontab wipe

Symptom: All 148 cron jobs vanished.

Root cause (actual): A crontab -l | python -c <inline> | crontab - editing pattern where the inline Python had a SyntaxError. Python exited with an error, producing empty stdout. crontab - with empty stdin replaces all jobs with nothing.

Process fix: “Never pipe inline expressions to crontab -.” Write to a temp file, validate, then install. This is now a hard rule, not a guideline.

Backfill: Restored from the most recent backup + re-applied all known deltas.

Regression guard: A daily crontab auto-snapshot cron (added the same day) bounds future loss to under 24 hours.

Symptom patch: All 148 jobs restored.

When the bandaid is explicitly chosen

Sometimes the right decision is: “bandaid this one, root-cause fix later.” That’s a valid call when the root-cause fix is a 30-minute build and the situation is blocking something more important.

The rule when this happens: open a loop. The surface symptom gets the bandaid; the root-cause fix gets an open-loop entry with an owner and a timeline. If the loop isn’t opened, the bandaid is a permanent fixture — and in six months you’ll forget why that manual workaround exists.

Push back on the bandaid when you can. The framing: “bandaid is 5 minutes, root-cause fix is 30 minutes, bandaid will recur in a different form next month. Recommend the 30-minute fix.” Then do it, unless JD explicitly says to defer.

What this looks like in an agent prompt

When spawning a build agent to fix a bug, include the root-cause-first doctrine explicitly:

Fix the bug. Follow the root-cause-first process:
1. Diagnose why the system allows this class of bug.
2. Fix the process that generates it.
3. Backfill existing instances.
4. Add a regression guard.
5. Then patch the surface symptom (often redundant after step 3).

Do NOT stop at step 5. If you can only do the surface patch in this session, open a loop for the root-cause fix and say so in your report.

Without this instruction, most agents default to step 5. The doctrine has to be in the prompt.

Next: The PR → CI → merge → deploy → smoke flow — and why JD reviews in prod, not in pull requests. Ship-to-prod workflow.