Cron Catchup & Resilience

Tier 2 · Building 6 min read

Before this, read:

Crons as a heartbeat — the heartbeat model whose resilience this article covers

Standard cron has no memory. If the machine is off at 7am when the morning briefing is scheduled, that run doesn’t happen and cron doesn’t compensate — the next run is tomorrow at 7am. For a system where 155+ jobs run daily, a reboot during the night means dozens of missed runs. Most are fine to skip. Some are not.

This article covers the two tools that make the cron heartbeat resilient: the catchup watchdog that replays critical missed runs, and the stale-cron alerter that watches for silent failures.

The catchup watchdog

~/agent-system/scripts/cron-catchup.sh, runs every 47 minutes.

The catchup script is an anacron-style solution: instead of trying to fix cron’s scheduling semantics, it runs frequently and checks whether certain jobs have run recently. If a job in the “safe-to-catchup” manifest hasn’t run within its expected window (because of a reboot or failure), the watchdog runs it with a CATCHUP=1 environment variable and logs the replay.

The CATCHUP=1 flag is important. Disruptive jobs — morning briefings, scripture checks, weekly reviews — honor this flag: when CATCHUP=1, they generate the artifact but skip the Telegram blast. The user gets the output; they don’t get the notification at an unexpected time.

Three scripts honor CATCHUP=1 (as of 2026-04-16, the initial ship):

morning-briefing.sh
scripture-check.sh
weekly-review.sh

The watchdog caught a real missed run on April 15, 2026 — a reboot the night before had dropped the morning briefing. The catchup ran it silently at 9am, produced the artifact, and skipped the Telegram blast.

What goes in the catchup manifest

Not every job belongs in the manifest. The decision is: if this job missed a run, does the missed output matter?

Yes (add to manifest): domain heartbeats, morning briefing, daily database backup, grade alerts, state syncs
No (skip): jobs that just send noise if run twice, jobs whose output is time-sensitive enough that a delayed run is worse than no run, jobs that have other alerting if they fail

The manifest is opt-in. Adding a script to the manifest means you’re asserting that a delayed run of that script produces a useful artifact. Be conservative.

Stale-cron alerting

~/agent-system/scripts/stale-cron-alerter.sh, runs every 2 hours.

The stale-cron alerter watches 7 critical cron jobs and checks their log timestamps. If a job’s last successful run is more than 2× its expected interval, the alerter pings JD on Telegram.

The 7 watched jobs (as of the initial 2026-04-16 ship):

Morning briefing (expected: daily)
Canvas grade check (expected: daily)
Gmail poll (expected: hourly)
Domain heartbeats (expected: hourly)
Grade alerts (expected: daily)
State sync (expected: daily)
MCP health check (expected: every 2 hours)

The alerter uses a 12-hour suppression window so it doesn’t spam the same failure six times a day. One alert per issue, per half-day.

The alert message includes:

Which job is stale
How long since its last successful run
The last few lines of its log (for immediate diagnosis)

Why alerting matters more than you’d expect

A cron failure that’s not surfaced is invisible. The job logs show errors; no one reads the logs; the system continues thinking things are working when they’re not. Before the stale-cron alerter, the system had a MCP health check that was 131 hours stale — nearly 6 days — because the check had silently failed and no one noticed.

The alerter is not smart. It doesn’t diagnose the failure or fix it. It does one thing: makes the failure visible before it’s been silently broken for a week.

The crontab backup

A separate failsafe: the crontab is automatically snapshotted daily at 6am:

0 6 * * * crontab -l > ~/agent-system/state/crontab-backups/$(date +%Y-%m-%d).txt

This exists because of the June 2026 incident where an inline Python SyntaxError wiped all 148 cron jobs by feeding empty stdin to crontab -. Recovery required restoring the June 6 backup and re-applying 6 days of known changes. With a daily snapshot, worst-case loss is under 24 hours.

The lesson applied as a process rule: never pipe inline expressions to crontab -. Write to a temp file, validate the file, then install it.

The path prelude problem

A large fraction of cron failures are not logic errors — they’re environment failures. The minimal PATH in a cron environment doesn’t include Homebrew’s /opt/homebrew/bin, so python3.12 isn’t found. PYTHONPATH isn’t set, so agents.shared.notify can’t be imported.

The standard prelude for any cron that invokes Python:

#!/bin/bash
cd ~/agent-system
export PATH="/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin"
export PYTHONPATH="$HOME/agent-system"
source ~/agent-system/.env 2>/dev/null

Any cron missing this prelude will work fine in an interactive terminal and fail silently in cron. An audit in May 2026 found that the PATH-in-cron regression had stopped several heartbeats and the CRM sync — they all failed with command not found: python3.12 in the logs, which no one was reading.

The domain-integrity-check script (~/agent-system/scripts/domain-integrity-check.sh, runs daily at 09:17) flags any agents.* cron line missing the cd/PYTHONPATH prelude as a part of its check.

Next: How to turn a Claude Code session into an always-on Telegram bot — the inbound router, command palette, and reply protocol. Telegram as a control plane.