The Expert Professor Agent (Canvas)

Tier 3 · Real Build 8 min read

In April 2026, JD had 19 PowerPoint decks (519 MB) from his BYU MBA analytics course, an upcoming final exam, and one question: could an agent trained on those slides answer exam questions better than a model guessing from general knowledge? The experiment produced something concrete: agents/professor — a retrieval-augmented agent that uses the actual course materials as its context, and that a triple-judge benchmark later confirmed scored 30/30 on the exam’s analytical questions while GPT-5 and Gemini 2.0 each scored in the 15–16 range on the same questions without access to computed data.

What the agent is

The professor agent is a RAG pipeline tuned to a single course’s content. It is not a generic chatbot that happens to answer school questions. Every response is grounded in specific slides, with citations back to the exact deck and slide number.

The architecture has three parts:

Ingestion — canvas_crawler.py fetches assignment data from the Canvas LMS API. box_downloader.py pulls lecture decks and data files from BYU Box (the university cloud storage). A parser chain handles PPTX, PDF, Markdown, and plain text. Each chunk is embedded with Voyage AI and stored in Chroma.
Retrieval — at query time, the system fetches a synthesis file and JD’s goal file unconditionally (course context and current focus), then retrieves the top-K chunks most relevant to the question using keyword + semantic similarity. For a question about OLS regression, you get Regression Days 1–4 slides 21, 43, 44, 53, 62, 79, and 82 — specific slides, not a fuzzy “check the regression unit.”
Generation — retrieved chunks feed into Opus with a system prompt that tells it to reason in the instructor’s voice, cite by slide number, and flag when a question requires running code rather than pattern-matching from training data.

The full MBA-560 ingest — 19 PPTs, 90 vault notes, 773 chunks — completes in under 30 seconds on the paid Voyage tier (CHANGELOG 2026-04-21 14:21).

The trigger protocol

One of the non-obvious design decisions: the agent fires on natural phrasing, not a slash command. From ~/agent-system/CLAUDE.md:

Trigger phrases (case-insensitive, partial match is fine):
- "matt madden", "madden agent", "ask madden"
- "analytics class agent", "analytics prof"
- "MBA-560 agent", "MBA 560 professor"

When the Telegram session sees any of those phrases, it calls agents.professor.ask() directly. JD types “ask madden about backward variable selection” and gets a cited answer — he doesn’t have to remember a command.

This matters because agents that require explicit invocation get used less. The whole point of training a course-expert agent is that JD reaches for it automatically when studying, the same way he’d text a smart classmate.

The 30/30 benchmark

The analytics final exam had 19 questions across regression, logistic modeling, clustering, conjoint analysis, and random forests. The test:

JD’s pipeline answered all 19 questions, running actual computations on the course’s CSV and XLSX data files.
A “round 2 bake-off” ran GPT and Gemini on the same questions without data access: GPT scored 16.8/30, Gemini 15.8/30.
Three independent judges (Sonnet 4.5, GPT-5.4, and Opus 4.7) verified the pipeline’s answers against the question set. No disagreement, no flagged weaknesses.

The CHANGELOG entry on that benchmark (2026-04-18 22:08) is direct: “Models DON’T compute on embedded data — pattern-match only. JD pipeline 30/30 stands. 45pt gap = tool-use value, not model-intelligence gap.”

That is the core lesson. A smarter model cannot close the gap that tool-use opens. The pipeline ran pandas, statsmodels, and sklearn against the actual data files. The competing models read the question and guessed from training patterns.

What the Madden audit added

After the initial build, the professor agent ran a self-audit against the actual course content. The audit (saved at ~/clawd/projects/professor-agent/madden-audit-2026-04-21.md) found the stack was strong on research and presentation infrastructure but thin on one specific workflow: the OLS diagnostic sequence.

The instructor’s enforced order is: diagnose → refine → validate → escalate. Skipping straight to Ridge/LASSO/Random Forest without first checking VIFs, residuals, and backward selection is the most common student error.

The audit produced REMEDY_ORDER.md at agents/analytics_suite/REMEDY_ORDER.md, which the solver is now required to follow — it will refuse to escalate to regularized methods until the diagnostic steps are complete. Five skills were also published to ~/.claude/skills/: regression-diagnostics, logit-classifier, survey-analyzer, data-prep, and tool-selector, each with Madden-voice checklists and anti-patterns.

That’s a real example of a RAG agent improving itself by grounding on the course content rather than defaulting to general ML conventions.

How it’s invoked

# From the terminal
python3.12 -m agents.professor --course MBA-560 "<question>"

# With a file attachment
python3.12 -m agents.professor --course MBA-560 \
  --file /path/to/exam.pdf "<question>"

# From another agent
from agents.professor.agent import ask
result = ask("MBA-560", user_question, files=[attached_path])
answer = result["answer"]   # citations already inline

Images go to Opus vision inputs directly. Attached documents (PDF, PPTX, DOCX, MD) are parsed to text and injected alongside the retrieved slide chunks. The retrieval query is seeded from the first 500 characters of any attached file, so hits steer toward the attached topic.

What to replicate

Separate ingestion from generation

The Chroma store is built once, offline. Query-time latency is just retrieval + one Opus call. Don’t embed on every request.

Unconditional context anchors

Load the synthesis file and the goal file on every query — course context and current objectives should always be present, not retrieved conditionally.

Natural-language triggers

Wire the agent to fire on phrasing patterns in CLAUDE.md. Agents that need explicit invocation get used less.

Self-audit after the first real run

After the agent answered real questions, an audit found a content gap. Budget time for a post-launch critique pass.

Next: The analytics suite scores on real exam data. The Analytics Suite covers the solver + verifier + playbook engine behind those 30/30 answers.