The One-Shot Methodology

We ship working applications in a single pass — idea in the morning, deployed and verified by evening — nearly every day. People assume it’s the model. It isn’t. The same model, prompted ad-hoc, produces the same coin-flip quality everyone complains about.

What changed our hit rate was process, enforced by machinery the agent cannot opt out of. This page is the audit of our own system: the pipeline every app goes through, the guardrails that are machine-enforced versus merely written down, and the ranked list of mechanisms that actually explain the results.

The core insight: instructions drift, infrastructure doesn’t

Every behavioral rule you write into a context file will eventually be ignored — not because the model is bad, but because context fills with task work and the rule loses salience. We learned this the hard way, repeatedly. A protocol that mattered to us failed for weeks despite being documented in ALL CAPS with examples, until we moved it into a Stop hook that fires automatically at the end of every turn. It has not failed since.

That pattern generalizes into our first law:

Anything that must happen every time gets a hook, a gate, a cron, or a pre-commit script. Prose rules are for judgment calls only.

Everything below follows from it.

The pipeline: how an app actually gets built

Every application moves through seven steps. Each one is implemented by a real script, agent, or hook — not by hoping the model remembers.

1. Scaffold — projects exist as files, not ideas

One command creates the project: a folder with README.md, PRD.md, WORKPLAN.md, CHANGELOG.md, and LINKS.md, plus an entry in a machine-readable project registry. Nothing starts as “a conversation we had.”

Failure prevented: lost context across sessions, re-invented requirements, no decision record.

2. Lock intent — written requirements before code

Significant changes must trace to written intent: a fresh PRD, user stories, or at minimum a commit message that states the story. A deterministic PRD gate (a script, not an LLM — it costs $0 to run) blocks significant changes that have no written intent anywhere.

The PRD’s most important section is its acceptance tests — concrete, user-visible behaviors derived from real usage, each one falsifiable. Not “chat should work well” but “clicking any live agent in any rail section opens its pane.”

Failure prevented: “I’ll fix this quickly” commits where nobody — including the agent — can later say what “this” was.

3. Brief — inject ground truth at session start

A SessionStart hook assembles a briefing for the build session: the diff so far, the PRD and milestones, prior related changes, and a map of the system’s contracts and shared boundaries. The agent reads ground truth before interpreting the request.

Failure prevented: the agent confidently building on a stale mental model of the codebase.

4. Build — under constraints that make parallelism safe

Worktree isolation is mandatory for any concurrent agent that touches git state. Each agent gets its own git worktree; a pre-commit hook on the shared checkout refuses commits from it outright, and a watchdog cron resets drift within minutes. We adopted this after parallel agents silently switched branches under each other three separate times, each costing 15+ minutes of reflog archaeology.
Reuse before rebuild. Before spawning a generic agent, the orchestrator must check the registry of existing specialists (QA runner, researcher, domain experts — each with a hardened CLI, shared output formats, and accumulated regression baselines). Re-implementing an existing specialist is a violation, not a style choice.
Model routing by tier. Heavyweight models for architectural judgment, mid-tier for execution, cheap models for scanning. The expensive review only fires when a deterministic significance classifier says the change warrants it.

5. Verify — intent verification, then spec-driven QA

Two independent checks, neither performed by the builder:

Intent verification. A read-only agent with no stake in the outcome re-derives what was asked (from the PRD and commit messages), reads the diff, and answers one question: does the delivery match the ask? A mismatch blocks. This catches the most insidious failure in agentic development — work that is technically fine but isn’t what anyone asked for.

Spec-driven QA. A QA agent parses the PRD into test specs and runs them with Playwright against the deployed app — desktop and mobile viewports, accessibility, visual regression against baselines, screenshots as evidence. The key design choice: QA tests what the PRD promised, never what the builder claimed. “I fixed it” is not evidence; a green run against the acceptance tests is.

Both gates fail open on infrastructure errors (a crashed verifier never wedges a build) and the review gate caps consecutive blocks so a builder/reviewer disagreement escalates to the human instead of looping forever.

6. Ship to prod — the human reviews products, not diffs

The orchestrating agent opens the PR, waits for CI, merges, deploys, smoke-tests the production URL itself, restarts anything that consumes the new code, and only then reports: “live at X, here’s what changed, go click it.” The human evaluates by using the product. Diffs are reviewed by a code-review agent, not by the human.

Failure prevented: the dead zone where finished work sits in PR limbo, and the human’s attention is spent on diff-reading instead of product judgment.

7. Log — every ship leaves a record

One command appends a timestamped entry to the project’s CHANGELOG.md and flips the matching WORKPLAN.md checkbox. The changelog — not chat history — is the source of truth for “what did we build.” Chat history dies at session rotation; files don’t.

Machine-enforced vs. prose: know which is which

Mechanism	Enforcement	Kind
PRD gate (intent required for significant changes)	Script on commit	Machine
Architect briefing at session start	SessionStart hook	Machine
Intent verification + architectural sign-off	Stop hook	Machine
Worktree isolation for parallel agents	Pre-commit hook + watchdog cron	Machine
Missed-cron replay after sleep/reboot	Anacron-style watchdog	Machine
Session digest for context handoff	SessionEnd hook	Machine
Root-cause-first bug discipline	Constitution (prose)	Judgment
Reuse-before-rebuild (specialist inventory)	Constitution + registry script	Hybrid
Ship-to-prod workflow	Constitution (prose)	Judgment
Push-back-when-wrong agent personality	Constitution (prose)	Judgment

The split is deliberate. Mechanical invariants (“never commit from the shared checkout”) are enforced mechanically. Judgment calls (“is this a bandaid or a root-cause fix?”) stay prose, because a script can’t make them — but the prose includes worked examples and the documented incidents that created each rule, which is what makes it stick.

Root-cause-first: why bugs don’t recur

Every bug gets this treatment, in this order, before the symptom is patched:

Diagnose the root cause — 5-Whys until you hit a process gap, not a code line.
Fix the process — the script, the hook, the schema, the convention that generated this class of bug.
Backfill existing instances of the bug class in one automated pass.
Add a regression guard so the class can’t return silently.
Then patch the original symptom — which is often already fixed by step 3, by design.

Example: a scheduled job missed its run after a reboot. The bandaid is “re-run it once.” The root-cause fix was an anacron-style catchup watchdog that detects and replays any missed job — which then caught a real miss two days later that nobody would have noticed.

This is why the system compounds. Most agent setups fight the same five fires forever; ours retires a fire class every time one burns.

The ranked mechanisms

If you can only adopt a few, adopt them in this order:

No unverified work reaches the human — spec-driven QA against the PRD’s acceptance tests, before ship, every time.
Independent intent verification — a no-stake agent checks the diff against the ask.
PRD gate — significant change with no written intent = blocked, deterministically.
File-based state — every decision, plan, and ship record lives in a file; sessions are disposable.
Worktree isolation + pre-commit enforcement — parallel agents physically cannot collide.
Root-cause-first discipline — bug classes get retired, not revisited.
Specialist registry with reuse mandate — tooling and regression baselines compound instead of resetting.
Changelog + workplan logging — progress is visible and auditable without reading transcripts.

Why it works

None of these mechanisms make the model smarter. They do something better: they remove the failure modes that don’t depend on intelligence. Lost context, drifted intent, unverified claims, parallel collisions, recurring bugs, invisible work — each one is structural, and each one is eliminated structurally.

What’s left is the model doing what it’s genuinely good at — writing code against a clear, written spec, with ground truth injected and verification waiting at the exit. That’s the whole trick. One-shot quality isn’t a prompt. It’s an exit gate the work has to pass through, and a paper trail that survives the session that produced it.