The Ollama Smart Router

Tier 3 · Real Build 6 min read

The Ollama smart router (agents/shared/ollama_client.py, registry slug ollama-smart-router-v1) is the component that decides whether an LLM call goes to a local Ollama instance, the cloud, or somewhere in between. The philosophy driving it is simple: spend the least amount of money that produces an acceptable answer for this specific task.

The routing hierarchy

Task arrives
  → Try MacBook Pro Ollama (192.168.1.175)
  → If unavailable: try Mac Mini Ollama (local)
  → If unavailable: fall back to cloud (OpenRouter → Anthropic direct)

This hierarchy was inherited from OpenClaw, where the same tiering logic produced claimed ~75–80% LLM cost savings over always-calling-a-premium-model. The key assumption: not every task needs Opus. A log summarization, a domain classification, a short formatting task, a yes/no judgment — these are cheap-tier work. Sending them to Opus is like dispatching a surgeon to change a lightbulb.

The MBP is listed first because in JD’s setup it is the primary Ollama host (the Mac Mini runs the always-on agent system). When the MBP is asleep or off the local network, the router falls through to Mac Mini or cloud without surfacing the failure to the caller.

What `ollama_client.py` provides

The shared client exposes call_ollama(). Any agent in the system can import it:

from agents.shared.ollama_client import call_ollama

response = call_ollama(
    prompt="Classify this log line as error/warning/info: ...",
    model="llama3.2",  # or whichever local model is pulled
    fallback="haiku"   # cloud tier to use if local unavailable
)

The auto-routing and fallback happen inside the call. The caller doesn’t need to know whether local or cloud handled it.

The cost-tier approach

The agent system uses four tiers:

Tier	Example tasks	Typical model
Local (free)	Log classification, short formatting, yes/no gates	Ollama local
Cheap cloud	Drafts, domain routing, summarization	Haiku / Sonnet
Mid-tier	Longer synthesis, report generation, code review	Sonnet
Opus	Strategic decisions, high-stakes drafts, complex reasoning	Opus

The Organization Charter codifies this (see The Organization Charter): worker-level tasks go to Haiku or local; orchestrator-level decisions go to Sonnet; CEO-level work goes to Opus. Routing is not a runtime optimization pass — it is a design decision made when an agent is built.

The real token-spend tracking that landed in May 2026 (CHANGELOG 2026-05-31) made the cost of this routing visible for the first time: ~$186/day, ~$2k/week. That number was what actually drove the discipline around tier routing — it is hard to optimize costs you cannot see.

OpenRouter → Anthropic direct fallback

A related routing piece: many agents originally had hand-rolled OpenRouter calls. When OpenRouter credits ran out, those calls would fail silently. The evolution engine’s wave 2 (CHANGELOG 2026-06-08) migrated 9 files to the shared llm_client that includes an Anthropic-direct fallback:

OpenRouter → 402 (out of credits) → Anthropic direct API

A CI gate (openrouter-literal) was added to catch any new hand-rolled OpenRouter calls at PR time. The migration itself was the fix; the gate prevents regression.

What local inference is actually good for

Local Ollama inference on a MacBook Pro or Mac Mini is fast and free, but it has real limits. Models that fit comfortably in 8–16 GB unified memory (7B–13B parameter models) run well. Larger models require swapping, which slows them below what a cloud API would return.

The practical ceiling for local use in this system: classification, routing decisions, short summarizations, formatting, and yes/no gates. Anything requiring long-context reasoning, nuanced writing, or complex code benefits from cloud.

The router does not currently do dynamic capability routing — it does not say “this task requires X capability, route to Y tier.” That is a future improvement. Currently the tier is set by the caller: agents are built with a specific tier in mind, and the router handles availability/fallback within that tier.

The OpenClaw lineage

The cost-first LLM router in OpenClaw was one of the three pieces that worked well enough to carry forward (the other two were the research loop and Obsidian memory). The original OpenClaw spec claimed ~75–80% cost savings and ~$20–35/month in cloud spend. The current system spends significantly more because it runs far more agents, but the routing discipline still applies: the baseline cost without local routing and tier discipline would be higher.

The Ollama smart router is one of the least glamorous pieces in the system. It does not produce visible outputs, it is not mentioned in LinkedIn posts, and it does not win demos. It saves money on every single LLM call. That is the point.

Next: The thing agents use to browse the web. The Agent Browser covers Playwright, Steel.dev fallback, per-site adapters, and the authenticated-profile pattern.