Skip to content
Subscribe

From Zero to 28 Agents: Our Model Routing Strategy

Narrative 10 min read

When you run 28 AI agents 24/7, cost is the constraint that shapes everything. Claude Opus is brilliant, but at $15 per million tokens, running everything through Opus would cost $50+ per day. We needed a strategy.

This is how we built a model routing system that keeps our entire operation under $15/day.


Not every task needs the smartest model. A cron job checking if a Telegram bot is alive doesn’t need Opus-level reasoning. A daily grade sync from Canvas LMS doesn’t need creative writing ability. But drafting a consulting proposal or writing a strategic document? That’s where you want the best.

The insight: classify tasks by complexity, then route to the cheapest model that can handle them.

We settled on three tiers after benchmarking several models:

TierModelsBest ForCost (per 1M tokens)
WorkerClaude Haiku 3.5, Ollama qwen2.5:7bFile operations, formatting, simple lookups, status checks$0.25-1.00 (Ollama: free)
SpecialistClaude Sonnet, Kimi K2.5Coding, analysis, research, most daily tasks$0.38-3.00
ExecutiveClaude OpusCreative writing, complex reasoning, strategy, nuanced decisions$15.00

The critical discovery was Kimi K2.5. We researched it on April 6 and found it was 4.1x cheaper than Sonnet ($0.60 vs $3.00 per million tokens) with competitive quality — 8.8 vs 9.1 on our benchmark suite. It scores 76.8% on SWE-Bench for coding tasks.

The tradeoff: Kimi is 5x slower and weaker on hallucination-prone tasks. So we use it for analytics, QA pipelines, and research observation — tasks where speed doesn’t matter and outputs are verifiable. Sonnet handles creative writing, tutoring, and resume work where nuance matters.

The model router (scripts/model-router.py) follows a simple pipeline:

1. Analyze task text for keywords and complexity signals
2. Classify into tier (worker / specialist / executive)
3. Check daily budget against $15 target
4. If >80% budget consumed → auto-downgrade one tier
5. Route to cheapest available model in tier

The auto-downgrade at 80% budget is the safety valve. If we’ve burned $12 by 3 PM, even specialist tasks get routed to Haiku or Ollama. This prevents budget blowouts from unexpected heavy usage.

The Unified Router: Three Systems, One Brain

Section titled “The Unified Router: Three Systems, One Brain”

The model router handles individual LLM calls, but we also needed routing across three entire systems:

  1. Agent System tools — MCP servers, cron jobs, file operations
  2. OpenClaw agents — Specialized agents with their own orchestration
  3. Claude Code sessions — Direct coding and complex tasks

The unified router (agents/ai_os/unified_router.py) uses weighted keyword scoring with a multi-word bonus. It’s health-aware: if OpenClaw’s gateway is down, it automatically demotes OC agents and routes to Agent System equivalents. A 30-second health cache prevents hammering the gateway.

Six OpenClaw agents were retired and mapped to AS equivalents:

  • research-master and researcher → AS research agent
  • slide-architect → AS Deck Architect
  • qa-engine and qa-pro → AS QA agent
  • quanta and coding → AS specialists

We built a 50-case test suite to validate routing accuracy: 100% accuracy, under 15ms latency.

Routing is pointless without visibility into spend. Our budget system:

  • State file: state/budget.json — running daily total
  • Per-request logging: every routed call logs estimated_cost_usd to routing.log
  • Nightly summary: track-costs.sh runs at 11 PM and sends a Telegram message with the day’s breakdown
  • Weekly reports: cost_tracker.py generates weekly cost reports with per-agent breakdowns

The Telegram summary looks like:

Daily Cost Report
─────────────────
Opus: $3.20 (2 calls)
Sonnet: $4.80 (47 calls)
Kimi: $1.10 (23 calls)
Haiku: $0.40 (89 calls)
Ollama: $0.00 (34 calls)
─────────────────
Total: $9.50 / $15.00

Running Ollama locally on the Mac Mini gives us a genuinely free tier. We use:

  • qwen2.5:7b — general-purpose worker tasks
  • nomic-embed-text — embeddings for semantic memory (Mem0) and RAG (Canvas LMS content)

These models run on Apple Silicon with zero API cost. The tradeoff is they’re slower and less capable than cloud models, but for embedding generation and simple formatting tasks, they’re more than sufficient.

1. Route by task, not by agent. Early on, we tried assigning models to agents (e.g., “the CRM agent always uses Haiku”). This was too rigid. The same agent might need Haiku for a simple lookup and Sonnet for a complex analysis. Route each individual task.

2. Auto-downgrade beats hard limits. We experimented with hard budget caps that stopped all operations. This was worse than degraded quality. Auto-downgrading keeps everything running, just at lower quality.

3. Benchmark your actual tasks. Generic benchmarks are misleading. Kimi K2.5 looks great on coding benchmarks but hallucinates on open-ended creative tasks. Test with YOUR workload.

4. Log everything. The JSONL routing log lets us retroactively analyze which models handled which tasks and whether quality suffered during downgrades. This data drives continuous optimization.

5. Free tiers compound. Ollama handles 15-20% of our daily requests at zero cost. Over a month, that’s $30-50 saved — enough to fund the occasional Opus splurge.


For the full technical details, see the Model Router Architecture reference.