Cost & Model Routing Fundamentals

Tier 1 · Fundamentals 9 min read

Agent systems can get expensive fast. A system that runs 155+ cron jobs, spawns parallel build agents, and does deep research has real LLM costs. Getting the cost picture right requires understanding the billing models and routing work to the cheapest model that can actually do it.

This page covers the three-tier routing strategy, the difference between Max quota and API dollars, and what a real cost picture looks like at scale.

The two billing models

Claude Max plan is a flat subscription (currently $100–200/mo depending on tier). You get a high monthly quota of Claude usage — the quota is denominated in “usage” rather than dollars. When you run Claude Code with a Max subscription, sessions draw from that quota rather than charging per token.

The ceiling matters: the Max plan has rate limits. If you hit them — spinning up many parallel agents, running long research tasks simultaneously — you’ll see 429 errors (“Too Many Requests”). At that point, additional work has to wait or route elsewhere.

API keys (via OpenRouter or Anthropic directly) charge per token. Input tokens and output tokens have separate rates that vary by model — Opus is significantly more expensive per token than Sonnet or Haiku. There’s no monthly cap, but there’s a bill.

The practical implication: a Max plan is economical for interactive, session-heavy work where you’re the active user. API keys at the right model tier are economical for automated background agents where you can pick the cheapest capable model. A mature system typically uses both.

The three routing tiers

The cost-routing strategy that came out of OpenClaw and carried into the current system:

Tier 1 — Local (free): Ollama runs open-weight models on your own hardware. For simple tasks — quick lookups, format conversions, short summaries, classification — a local 8B or 32B model is fast, free, and good enough. The system uses this for lightweight background work that doesn’t need Sonnet’s reasoning.

Tier 2 — Cheap cloud: Sonnet-class models via OpenRouter or Anthropic API. The majority of background agent work goes here. Sonnet is capable, costs a fraction of Opus per token, and handles most coding and analysis tasks well.

Tier 3 — Opus/heavy: Reserved for orchestration decisions, complex multi-file architectures, adversarial synthesis, and tasks where Sonnet demonstrably fails. Opus is not 10× better at most tasks — it’s 10× more expensive. Use it only where the reasoning advantage is real and measurable.

The routing logic in practice:

def route_model(task_type, complexity):
    if task_type in ("classification", "format", "lookup"):
        return "ollama/llama3.2"        # free, local
    if complexity < "high":
        return "anthropic/claude-sonnet-4-5"  # cheap cloud
    return "anthropic/claude-opus-4-5"         # heavy, deliberate

What the cost picture actually looks like

Running a real agent system at scale isn’t cheap, and understating this is a form of dishonesty.

The system described in Tier 3 tracked token spend after wiring real cost attribution in May 2026. The honest numbers: approximately $186/day, roughly $2,000/week at peak build periods. The vast majority of that was Opus sessions doing orchestration and research.

After cost-routing was applied — pushing routine background work to Sonnet, simple tasks to Ollama, reserving Opus for genuinely complex decisions — the picture improved significantly. The OpenClaw spec cited ~75–80% LLM cost reduction from local-first routing (claims ~$20–35/month for the research loop use case). That figure is plausible for narrow workloads, but broader agent systems doing deep research, parallel builds, and orchestration cost more.

The honest framing: cost-routing can achieve 50–80% reduction compared to using Opus for everything, but the absolute number depends on how much work the system does. Budget for the actual volume.

Tracking spend

The system added a token_spend table to Supabase in 2026-05-26 after the cockpit was showing $0.00 for actual spend — a lie that masked real costs. The fix: every agent run logs actual token counts and costs (input × model rate + output × model rate), and the cockpit’s cost HUD shows the real 7-day rolling total.

If you’re building an agent system that runs background jobs, build cost tracking in early. A system that doesn’t know what it costs will eventually surprise you.

The Max quota vs API dollars decision

For someone just starting out with Claude Code:

If you’re doing primarily interactive work (you, at a terminal, with Claude Code), start with a Max subscription. It’s simpler billing and the quota handles most interactive use.
If you’re building automated background agents that run without you, add API key billing for those jobs. Pick the cheapest capable model tier and track the spend.
If you hit Max rate limits during burst periods (spawning many parallel agents), you need API key overflow — Max alone won’t scale past its rate ceiling.

The production setup for the system in Tier 3: Max plan for interactive sessions + OpenRouter API key for background agents + Anthropic direct API as fallback when OpenRouter credits run out. The fallback matters: OpenRouter credits run out unexpectedly, and a system that has no fallback silently stops working at the worst time.

You’ve finished Tier 1. You now have the fundamentals: sessions and context, files-as-state, CLAUDE.md, tools, prompting, MCP, skills, hooks, settings, git discipline, logging, and cost routing. Tier 2 is where these primitives combine into systems — spawning agents, scheduling them, orchestrating them, and shipping reliably.

Next: Head to Tier 2 — Building Patterns when you’re ready, starting with Anatomy of an Agent.