The One-Shot Methodology
We ship working applications in a single pass — idea in the morning, deployed and verified by evening — nearly every day. People assume it’s the model. It isn’t. The same model, prompted ad-hoc, produces the same coin-flip quality everyone complains about.
What changed our hit rate was process, enforced by machinery the agent cannot opt out of. This page is the audit of our own system: the pipeline every app goes through, the guardrails that are machine-enforced versus merely written down, and the ranked list of mechanisms that actually explain the results.
The core insight: instructions drift, infrastructure doesn’t
Section titled “The core insight: instructions drift, infrastructure doesn’t”Every behavioral rule you write into a context file will eventually be ignored — not because the model is bad, but because context fills with task work and the rule loses salience. We learned this the hard way, repeatedly. A protocol that mattered to us failed for weeks despite being documented in ALL CAPS with examples, until we moved it into a Stop hook that fires automatically at the end of every turn. It has not failed since.
That pattern generalizes into our first law:
Anything that must happen every time gets a hook, a gate, a cron, or a pre-commit script. Prose rules are for judgment calls only.
Everything below follows from it.
The pipeline: how an app actually gets built
Section titled “The pipeline: how an app actually gets built”Every application moves through seven steps. Each one is implemented by a real script, agent, or hook — not by hoping the model remembers.
1. Scaffold — projects exist as files, not ideas
Section titled “1. Scaffold — projects exist as files, not ideas”One command creates the project: a folder with README.md, PRD.md, WORKPLAN.md, CHANGELOG.md, and LINKS.md, plus an entry in a machine-readable project registry. Nothing starts as “a conversation we had.”
Failure prevented: lost context across sessions, re-invented requirements, no decision record.
2. Lock intent — written requirements before code
Section titled “2. Lock intent — written requirements before code”Significant changes must trace to written intent: a fresh PRD, user stories, or at minimum a commit message that states the story. A deterministic PRD gate (a script, not an LLM — it costs $0 to run) blocks significant changes that have no written intent anywhere.
The PRD’s most important section is its acceptance tests — concrete, user-visible behaviors derived from real usage, each one falsifiable. Not “chat should work well” but “clicking any live agent in any rail section opens its pane.”
Failure prevented: “I’ll fix this quickly” commits where nobody — including the agent — can later say what “this” was.
3. Brief — inject ground truth at session start
Section titled “3. Brief — inject ground truth at session start”A SessionStart hook assembles a briefing for the build session: the diff so far, the PRD and milestones, prior related changes, and a map of the system’s contracts and shared boundaries. The agent reads ground truth before interpreting the request.
Failure prevented: the agent confidently building on a stale mental model of the codebase.
4. Build — under constraints that make parallelism safe
Section titled “4. Build — under constraints that make parallelism safe”- Worktree isolation is mandatory for any concurrent agent that touches git state. Each agent gets its own git worktree; a pre-commit hook on the shared checkout refuses commits from it outright, and a watchdog cron resets drift within minutes. We adopted this after parallel agents silently switched branches under each other three separate times, each costing 15+ minutes of reflog archaeology.
- Reuse before rebuild. Before spawning a generic agent, the orchestrator must check the registry of existing specialists (QA runner, researcher, domain experts — each with a hardened CLI, shared output formats, and accumulated regression baselines). Re-implementing an existing specialist is a violation, not a style choice.
- Model routing by tier. Heavyweight models for architectural judgment, mid-tier for execution, cheap models for scanning. The expensive review only fires when a deterministic significance classifier says the change warrants it.
5. Verify — intent verification, then spec-driven QA
Section titled “5. Verify — intent verification, then spec-driven QA”Two independent checks, neither performed by the builder:
Intent verification. A read-only agent with no stake in the outcome re-derives what was asked (from the PRD and commit messages), reads the diff, and answers one question: does the delivery match the ask? A mismatch blocks. This catches the most insidious failure in agentic development — work that is technically fine but isn’t what anyone asked for.
Spec-driven QA. A QA agent parses the PRD into test specs and runs them with Playwright against the deployed app — desktop and mobile viewports, accessibility, visual regression against baselines, screenshots as evidence. The key design choice: QA tests what the PRD promised, never what the builder claimed. “I fixed it” is not evidence; a green run against the acceptance tests is.
Both gates fail open on infrastructure errors (a crashed verifier never wedges a build) and the review gate caps consecutive blocks so a builder/reviewer disagreement escalates to the human instead of looping forever.
6. Ship to prod — the human reviews products, not diffs
Section titled “6. Ship to prod — the human reviews products, not diffs”The orchestrating agent opens the PR, waits for CI, merges, deploys, smoke-tests the production URL itself, restarts anything that consumes the new code, and only then reports: “live at X, here’s what changed, go click it.” The human evaluates by using the product. Diffs are reviewed by a code-review agent, not by the human.
Failure prevented: the dead zone where finished work sits in PR limbo, and the human’s attention is spent on diff-reading instead of product judgment.
7. Log — every ship leaves a record
Section titled “7. Log — every ship leaves a record”One command appends a timestamped entry to the project’s CHANGELOG.md and flips the matching WORKPLAN.md checkbox. The changelog — not chat history — is the source of truth for “what did we build.” Chat history dies at session rotation; files don’t.
Machine-enforced vs. prose: know which is which
Section titled “Machine-enforced vs. prose: know which is which”| Mechanism | Enforcement | Kind |
|---|---|---|
| PRD gate (intent required for significant changes) | Script on commit | Machine |
| Architect briefing at session start | SessionStart hook | Machine |
| Intent verification + architectural sign-off | Stop hook | Machine |
| Worktree isolation for parallel agents | Pre-commit hook + watchdog cron | Machine |
| Missed-cron replay after sleep/reboot | Anacron-style watchdog | Machine |
| Session digest for context handoff | SessionEnd hook | Machine |
| Root-cause-first bug discipline | Constitution (prose) | Judgment |
| Reuse-before-rebuild (specialist inventory) | Constitution + registry script | Hybrid |
| Ship-to-prod workflow | Constitution (prose) | Judgment |
| Push-back-when-wrong agent personality | Constitution (prose) | Judgment |
The split is deliberate. Mechanical invariants (“never commit from the shared checkout”) are enforced mechanically. Judgment calls (“is this a bandaid or a root-cause fix?”) stay prose, because a script can’t make them — but the prose includes worked examples and the documented incidents that created each rule, which is what makes it stick.
Root-cause-first: why bugs don’t recur
Section titled “Root-cause-first: why bugs don’t recur”Every bug gets this treatment, in this order, before the symptom is patched:
- Diagnose the root cause — 5-Whys until you hit a process gap, not a code line.
- Fix the process — the script, the hook, the schema, the convention that generated this class of bug.
- Backfill existing instances of the bug class in one automated pass.
- Add a regression guard so the class can’t return silently.
- Then patch the original symptom — which is often already fixed by step 3, by design.
Example: a scheduled job missed its run after a reboot. The bandaid is “re-run it once.” The root-cause fix was an anacron-style catchup watchdog that detects and replays any missed job — which then caught a real miss two days later that nobody would have noticed.
This is why the system compounds. Most agent setups fight the same five fires forever; ours retires a fire class every time one burns.
The ranked mechanisms
Section titled “The ranked mechanisms”If you can only adopt a few, adopt them in this order:
- No unverified work reaches the human — spec-driven QA against the PRD’s acceptance tests, before ship, every time.
- Independent intent verification — a no-stake agent checks the diff against the ask.
- PRD gate — significant change with no written intent = blocked, deterministically.
- File-based state — every decision, plan, and ship record lives in a file; sessions are disposable.
- Worktree isolation + pre-commit enforcement — parallel agents physically cannot collide.
- Root-cause-first discipline — bug classes get retired, not revisited.
- Specialist registry with reuse mandate — tooling and regression baselines compound instead of resetting.
- Changelog + workplan logging — progress is visible and auditable without reading transcripts.
Why it works
Section titled “Why it works”None of these mechanisms make the model smarter. They do something better: they remove the failure modes that don’t depend on intelligence. Lost context, drifted intent, unverified claims, parallel collisions, recurring bugs, invisible work — each one is structural, and each one is eliminated structurally.
What’s left is the model doing what it’s genuinely good at — writing code against a clear, written spec, with ground truth injected and verification waiting at the exit. That’s the whole trick. One-shot quality isn’t a prompt. It’s an exit gate the work has to pass through, and a paper trail that survives the session that produced it.