The Deep-Research Agent

Tier 3 · What we built 9 min read

Before this, read:

Tools 101 — tool calls are the mechanism the research loop runs on
Anatomy of an agent — the package structure this agent follows

The research agent is one of the oldest pieces of this system. It started as the heart of OpenClaw — the “iterative knowledge-gap loop with 5 subagents” described in the original spec — and survived the pivot to Claude Code intact. Today it lives at agents/researcher/ and runs as the backbone for everything that needs more than a single search to answer: the merch store’s initial market research, the competitive landscape analysis for the cockpit, the growth strategy, the health expert’s KB, and more.

The loop structure

The core loop (agents/researcher/core/loop.py) runs a configurable number of iterations. Each iteration:

Knowledge-gap agent — given what we know so far, what do we still not know? Returns a list of targeted sub-questions.
Tool-selector agent — for each gap, which source is most likely to answer it? Chooses from: brave_search, github_search, reddit_search, arxiv_search, hn_search, firecrawl_extract, semantic_scholar_search, stackoverflow_search, wikipedia_search, twitter_search.
Parallel tool execution — runs selected tools against the gap questions, collects results.
Observations agent — what did the sources actually say? Builds structured findings.
Devil’s advocate — what’s weak, contradicted, or missing in the findings so far?
Critique-gap agent — updates the knowledge gap list based on what the devil’s advocate found.

After all iterations, a writer agent synthesizes the findings into a cited report and saves it to Obsidian at ~/obsidian-vault/11-Agents/research/.

The loop also tracks: confidence level, a citation graph, multi-hop reasoning across sources, and cross-source triangulation. Cost is logged per-run via a CostTracker that records which models and tool calls were used.

Depth tiers

Three named tiers configure the loop:

shallow: 2–3 iterations, limited tool budget, fast (~$0.50–$1)
standard: 5–6 iterations, broader tool coverage
deep: 8–12 iterations, full tool suite, cross-source triangulation, multi-hop reasoning (~$5–$10)

The flagship research run on 2026-06-08 (competitive landscape for the cockpit) ran at deep: 370 sources, $7.78. The AI coding agent market research run the same day ran at depth 8, producing the ChatbotToNerveCenter deck.

Run from the CLI:

python3.12 -m agents.researcher.main "what are X/Reddit saying about AI coding agents June 2026" --depth deep

The X and Reddit backends

The research agent’s social coverage expanded significantly on 2026-06-08 when both X and Reddit backends went live.

Reddit: the original reddit_tool.py used a dead API endpoint (PullPush, archived ~May 2025) and a broken sort. The rewrite switched to Reddit’s OAuth API with sort=relevance (the previous sort=new was ignoring the query parameter entirely). The 90-day recency window is configurable via config.social_lookback_days. Source: commit 5acb55d.

X/Twitter: tools/twitter.py uses the twitterapi.io backend (approximately $0.15/1k results). It returns real same-day posts with @handle, date, and likes. The official X API is write-only for this tier; twitterapi.io provides the read path. Source: commit after 5acb55d, same day.

Both backends return results within a trailing 90-day window, sorted newest-first. The tool-selector agent picks between X, Reddit, Brave, arXiv, and the others based on the gap question type — social signal questions go to X/Reddit; academic claims go to arXiv/Semantic Scholar; code questions go to GitHub/StackOverflow.

Checkpointing and resumption

Long runs (deep tier, 10+ iterations) can exceed a session’s reliable execution window. The loop checkpoints state after each iteration via lib.checkpoint.CheckpointManager. If a run is interrupted — network failure, session rotation, process kill — it resumes from the last completed iteration on the next invocation with the same task_id (deterministic from query + depth hash).

The task ID is:

def _research_task_id(query: str, depth: str) -> str:
    h = hashlib.sha256(f"{query}::{depth}".encode()).hexdigest()[:16]
    return f"research-{h}"

Same query + same depth → same ID → resumes from the last checkpoint. A new query always starts fresh.

What it’s been used for

Real runs that produced real artifacts:

Merch store initial research (2026-06-08): Printful vs Printify stack comparison, IP-safety legal landscape, POD economics. Produced BUILD-PLAN.md.
Cockpit competitive landscape (2026-06-08, 370 sources, $7.78): NC5 competitive position vs the field. Report at ~/clawd/projects/cockpit-chat-v3/COMPETITIVE-LANDSCAPE-2026-06-08.md.
AI coding agent market research (2026-06-08, 8 iterations): X/Reddit sentiment on Claude Code, Cursor, and GitHub Copilot. Became the source deck for the ChatbotToNerveCenter presentation.
Health expert KB (2026-06-06): retatrutide, peptide stacking, GLP-1 muscle retention, extended fasting. ~180KB of cited KB files at ~/clawd/domains/health/state/expert-kb/.
X growth strategy (2026-05): 5 parallel deep-research streams, produced STRATEGY.md.
Claude Code on your phone (2026-05-27): 307 sources, $4.75, confidence 51/100 (structural solid, specifics thin — the agent said so).

The last example is worth noting: the researcher reports its own confidence score. A run that came back at 51/100 is the agent flagging that the answer is structurally sound but thin on specifics — that’s honest output, not a failed run.

Cost and caching

Results are cached for 1 hour via ResearchCache backed by Supabase (scrape_cache table). A second run on the same query within the cache window skips the tool calls and re-synthesizes from the cached results, costing only the writer pass.

The main per-run cost is the deep-tier LLM calls. The standard breakdown for a deep run: ~60% LLM (the writer + devil’s advocate passes are expensive), ~25% tool call costs (mostly twitterapi.io on queries with social signal), ~15% overhead. Budget: shallow ~$0.50, standard ~$2–3, deep ~$5–10.

For queries where Brave alone is sufficient (most factual questions with no social-signal component), the tool-selector routes away from paid backends and the cost stays near the low end.

Next: What happens when the research output needs to become a McKinsey-style deck — Deck Architect v4.