Skip to content
🎓 Find your path Subscribe

The One-Shot Methodology

We ship working applications in a single pass — idea in the morning, deployed and verified by evening — nearly every day. People assume it’s the model. It isn’t. The same model, prompted ad-hoc, produces the same coin-flip quality everyone complains about.

What changed our hit rate was process, enforced by machinery the agent cannot opt out of. This page is the audit of our own system: the pipeline every app goes through, the guardrails that are machine-enforced versus merely written down, and the ranked list of mechanisms that actually explain the results.

The core insight: instructions drift, infrastructure doesn’t

Section titled “The core insight: instructions drift, infrastructure doesn’t”

Every behavioral rule you write into a context file will eventually be ignored — not because the model is bad, but because context fills with task work and the rule loses salience. We learned this the hard way, repeatedly. A protocol that mattered to us failed for weeks despite being documented in ALL CAPS with examples, until we moved it into a Stop hook that fires automatically at the end of every turn. It has not failed since.

That pattern generalizes into our first law:

Anything that must happen every time gets a hook, a gate, a cron, or a pre-commit script. Prose rules are for judgment calls only.

Everything below follows from it.

The pipeline: how an app actually gets built

Section titled “The pipeline: how an app actually gets built”

Every application moves through seven steps. Each one is implemented by a real script, agent, or hook — not by hoping the model remembers.

1. Scaffold — projects exist as files, not ideas

Section titled “1. Scaffold — projects exist as files, not ideas”

One command creates the project: a folder with README.md, PRD.md, WORKPLAN.md, CHANGELOG.md, and LINKS.md, plus an entry in a machine-readable project registry. Nothing starts as “a conversation we had.”

Failure prevented: lost context across sessions, re-invented requirements, no decision record.

2. Lock intent — written requirements before code

Section titled “2. Lock intent — written requirements before code”

Significant changes must trace to written intent: a fresh PRD, user stories, or at minimum a commit message that states the story. A deterministic PRD gate (a script, not an LLM — it costs $0 to run) blocks significant changes that have no written intent anywhere.

The PRD’s most important section is its acceptance tests — concrete, user-visible behaviors derived from real usage, each one falsifiable. Not “chat should work well” but “clicking any live agent in any rail section opens its pane.”

Failure prevented: “I’ll fix this quickly” commits where nobody — including the agent — can later say what “this” was.

3. Brief — inject ground truth at session start

Section titled “3. Brief — inject ground truth at session start”

A SessionStart hook assembles a briefing for the build session: the diff so far, the PRD and milestones, prior related changes, and a map of the system’s contracts and shared boundaries. The agent reads ground truth before interpreting the request.

Failure prevented: the agent confidently building on a stale mental model of the codebase.

4. Build — under constraints that make parallelism safe

Section titled “4. Build — under constraints that make parallelism safe”
  • Worktree isolation is mandatory for any concurrent agent that touches git state. Each agent gets its own git worktree; a pre-commit hook on the shared checkout refuses commits from it outright, and a watchdog cron resets drift within minutes. We adopted this after parallel agents silently switched branches under each other three separate times, each costing 15+ minutes of reflog archaeology.
  • Reuse before rebuild. Before spawning a generic agent, the orchestrator must check the registry of existing specialists (QA runner, researcher, domain experts — each with a hardened CLI, shared output formats, and accumulated regression baselines). Re-implementing an existing specialist is a violation, not a style choice.
  • Model routing by tier. Heavyweight models for architectural judgment, mid-tier for execution, cheap models for scanning. The expensive review only fires when a deterministic significance classifier says the change warrants it.

5. Verify — intent verification, then spec-driven QA

Section titled “5. Verify — intent verification, then spec-driven QA”

Two independent checks, neither performed by the builder:

Intent verification. A read-only agent with no stake in the outcome re-derives what was asked (from the PRD and commit messages), reads the diff, and answers one question: does the delivery match the ask? A mismatch blocks. This catches the most insidious failure in agentic development — work that is technically fine but isn’t what anyone asked for.

Spec-driven QA. A QA agent parses the PRD into test specs and runs them with Playwright against the deployed app — desktop and mobile viewports, accessibility, visual regression against baselines, screenshots as evidence. The key design choice: QA tests what the PRD promised, never what the builder claimed. “I fixed it” is not evidence; a green run against the acceptance tests is.

Both gates fail open on infrastructure errors (a crashed verifier never wedges a build) and the review gate caps consecutive blocks so a builder/reviewer disagreement escalates to the human instead of looping forever.

6. Ship to prod — the human reviews products, not diffs

Section titled “6. Ship to prod — the human reviews products, not diffs”

The orchestrating agent opens the PR, waits for CI, merges, deploys, smoke-tests the production URL itself, restarts anything that consumes the new code, and only then reports: “live at X, here’s what changed, go click it.” The human evaluates by using the product. Diffs are reviewed by a code-review agent, not by the human.

Failure prevented: the dead zone where finished work sits in PR limbo, and the human’s attention is spent on diff-reading instead of product judgment.

One command appends a timestamped entry to the project’s CHANGELOG.md and flips the matching WORKPLAN.md checkbox. The changelog — not chat history — is the source of truth for “what did we build.” Chat history dies at session rotation; files don’t.

Machine-enforced vs. prose: know which is which

Section titled “Machine-enforced vs. prose: know which is which”
MechanismEnforcementKind
PRD gate (intent required for significant changes)Script on commitMachine
Architect briefing at session startSessionStart hookMachine
Intent verification + architectural sign-offStop hookMachine
Worktree isolation for parallel agentsPre-commit hook + watchdog cronMachine
Missed-cron replay after sleep/rebootAnacron-style watchdogMachine
Session digest for context handoffSessionEnd hookMachine
Root-cause-first bug disciplineConstitution (prose)Judgment
Reuse-before-rebuild (specialist inventory)Constitution + registry scriptHybrid
Ship-to-prod workflowConstitution (prose)Judgment
Push-back-when-wrong agent personalityConstitution (prose)Judgment

The split is deliberate. Mechanical invariants (“never commit from the shared checkout”) are enforced mechanically. Judgment calls (“is this a bandaid or a root-cause fix?”) stay prose, because a script can’t make them — but the prose includes worked examples and the documented incidents that created each rule, which is what makes it stick.

Every bug gets this treatment, in this order, before the symptom is patched:

  1. Diagnose the root cause — 5-Whys until you hit a process gap, not a code line.
  2. Fix the process — the script, the hook, the schema, the convention that generated this class of bug.
  3. Backfill existing instances of the bug class in one automated pass.
  4. Add a regression guard so the class can’t return silently.
  5. Then patch the original symptom — which is often already fixed by step 3, by design.

Example: a scheduled job missed its run after a reboot. The bandaid is “re-run it once.” The root-cause fix was an anacron-style catchup watchdog that detects and replays any missed job — which then caught a real miss two days later that nobody would have noticed.

This is why the system compounds. Most agent setups fight the same five fires forever; ours retires a fire class every time one burns.

If you can only adopt a few, adopt them in this order:

  1. No unverified work reaches the human — spec-driven QA against the PRD’s acceptance tests, before ship, every time.
  2. Independent intent verification — a no-stake agent checks the diff against the ask.
  3. PRD gate — significant change with no written intent = blocked, deterministically.
  4. File-based state — every decision, plan, and ship record lives in a file; sessions are disposable.
  5. Worktree isolation + pre-commit enforcement — parallel agents physically cannot collide.
  6. Root-cause-first discipline — bug classes get retired, not revisited.
  7. Specialist registry with reuse mandate — tooling and regression baselines compound instead of resetting.
  8. Changelog + workplan logging — progress is visible and auditable without reading transcripts.

None of these mechanisms make the model smarter. They do something better: they remove the failure modes that don’t depend on intelligence. Lost context, drifted intent, unverified claims, parallel collisions, recurring bugs, invisible work — each one is structural, and each one is eliminated structurally.

What’s left is the model doing what it’s genuinely good at — writing code against a clear, written spec, with ground truth injected and verification waiting at the exit. That’s the whole trick. One-shot quality isn’t a prompt. It’s an exit gate the work has to pass through, and a paper trail that survives the session that produced it.