Model Routing Strategy: Running Agents Under Budget

Narrative 10 min read

When you run dozens of AI agents 24/7, cost is the constraint that shapes everything. Claude Opus is brilliant, but at $15 per million tokens, running everything through Opus would cost $50+ per day. We needed a strategy.

This is how we built a model routing system that keeps our entire operation under $15/day.

The Problem: One Model Doesn’t Fit All

Not every task needs the smartest model. A cron job checking if a Telegram bot is alive doesn’t need Opus-level reasoning. A daily grade sync from Canvas LMS doesn’t need creative writing ability. But drafting a consulting proposal or writing a strategic document? That’s where you want the best.

The insight: classify tasks by complexity, then route to the cheapest model that can handle them.

The Three Tiers

We settled on three tiers after benchmarking several models:

Tier	Models	Best For	Cost (per 1M tokens)
Worker	Claude Haiku 3.5, Ollama qwen2.5:7b	File operations, formatting, simple lookups, status checks	$0.25-1.00 (Ollama: free)
Specialist	Claude Sonnet, Kimi K2.5	Coding, analysis, research, most daily tasks	$0.38-3.00
Executive	Claude Opus	Creative writing, complex reasoning, strategy, nuanced decisions	$15.00

The critical discovery was Kimi K2.5. Researched on 2026-04-06: it was approximately 4–5x cheaper than Sonnet at the time (verify current pricing at the provider’s rate card). Quality is competitive for structured tasks — the publicly reported SWE-Bench score is strong for coding work.

The tradeoff: Kimi is slower and weaker on hallucination-prone open-ended tasks. So we use it for analytics, QA pipelines, and research observation — tasks where speed doesn’t matter and outputs are verifiable. Sonnet handles creative writing, tutoring, and resume work where nuance matters.

The Routing Logic

The model router (scripts/model-router.py) follows a simple pipeline:

1. Analyze task text for keywords and complexity signals
2. Classify into tier (worker / specialist / executive)
3. Check daily budget against $15 target
4. If >80% budget consumed → auto-downgrade one tier
5. Route to cheapest available model in tier

The auto-downgrade at 80% budget is the safety valve. If we’ve burned $12 by 3 PM, even specialist tasks get routed to Haiku or Ollama. This prevents budget blowouts from unexpected heavy usage.

The Unified Router: Three Systems, One Brain

The model router handles individual LLM calls, but we also needed routing across three entire systems:

Agent System tools — MCP servers, cron jobs, file operations
OpenClaw agents — Specialized agents with their own orchestration
Claude Code sessions — Direct coding and complex tasks

The unified router (agents/ai_os/unified_router.py) uses weighted keyword scoring with a multi-word bonus. It’s health-aware: if OpenClaw’s gateway is down, it automatically demotes OC agents and routes to Agent System equivalents. A 30-second health cache prevents hammering the gateway.

Six OpenClaw agents were retired and mapped to AS equivalents:

research-master and researcher → AS research agent
slide-architect → AS Deck Architect
qa-engine and qa-pro → AS QA agent
quanta and coding → AS specialists

We built a 50-case test suite to validate routing accuracy; the suite passes and routing decisions are sub-second in practice.

Budget Tracking

Routing is pointless without visibility into spend. Our budget system:

State file: state/budget.json — running daily total
Per-request logging: every routed call logs estimated_cost_usd to routing.log
Nightly summary: track-costs.sh runs at 11 PM and sends a Telegram message with the day’s breakdown
Weekly reports: cost_tracker.py generates weekly cost reports with per-agent breakdowns

The Telegram summary looks like:

Daily Cost Report
─────────────────
Opus:     $3.20 (2 calls)
Sonnet:   $4.80 (47 calls)
Kimi:     $1.10 (23 calls)
Haiku:    $0.40 (89 calls)
Ollama:   $0.00 (34 calls)
─────────────────
Total:    $9.50 / $15.00

Ollama: The Free Tier

Running Ollama locally on the Mac Mini gives us a genuinely free tier. We use:

qwen2.5:7b — general-purpose worker tasks
nomic-embed-text — embeddings for semantic memory (Mem0) and RAG (Canvas LMS content)

These models run on Apple Silicon with zero API cost. The tradeoff is they’re slower and less capable than cloud models, but for embedding generation and simple formatting tasks, they’re more than sufficient.

Lessons Learned

1. Route by task, not by agent. Early on, we tried assigning models to agents (e.g., “the CRM agent always uses Haiku”). This was too rigid. The same agent might need Haiku for a simple lookup and Sonnet for a complex analysis. Route each individual task.

2. Auto-downgrade beats hard limits. We experimented with hard budget caps that stopped all operations. This was worse than degraded quality. Auto-downgrading keeps everything running, just at lower quality.

3. Benchmark your actual tasks. Generic benchmarks are misleading. Kimi K2.5 looks great on coding benchmarks but hallucinates on open-ended creative tasks. Test with YOUR workload.

4. Log everything. The JSONL routing log lets us retroactively analyze which models handled which tasks and whether quality suffered during downgrades. This data drives continuous optimization.

5. Free tiers compound. Ollama handles 15-20% of our daily requests at zero cost. Over a month, that’s $30-50 saved — enough to fund the occasional Opus splurge.

For the full technical details, see the Model Router Architecture reference.