The Analytics Suite

Tier 3 · Real Build 7 min read

The analytics suite (agents/analytics_suite) is a structured solver for MBA-level statistics and ML problems. It is not a general coding assistant. It knows seven topic areas, and for each one it enforces a diagnostic sequence before it will escalate to more complex methods.

Built in April 2026 and verified on a real final exam, it was the empirical test of a hypothesis: that the gap between a model that pattern-matches and a model that computes is larger than any model-intelligence gap.

What shipped

Six milestones in one day (CHANGELOG 2026-04-18):

M1 — quanta smoke test (baseline integration check)
M2 — 5-page personal cheat sheet PDF built from 7 graded submissions, covering every exam topic with JD’s own worked examples
M3 — 7 playbook modules: SQL, OLS regression, random forest, logistic regression, K-means clustering, conjoint analysis, text analytics
M4 — problem parser with 8/8 classification accuracy (reads a question, identifies the method)
M5 — solver + verifier + marimo notebook export
M6 — full 7-topic end-to-end smoke test, all PASS 100%

CLI: python3.12 -m agents.analytics_suite solve|parse|topics|template

The enforced diagnostic sequence

The most important piece is not the solver — it is the constraint on what the solver is allowed to do. From agents/analytics_suite/REMEDY_ORDER.md:

1. Diagnose    — fit OLS, check R²/Adj R² gap, VIFs, F p, residuals
2. Refine      — backward variable selection (drop max-p iteratively)
3. Validate    — train/test split on the refined model only
4. Escalate    — Ridge/LASSO/RF ONLY if step 3 proves refined OLS
                  cannot meet the business requirement

The solver refuses to jump to regularized methods or tree-based models until it has completed the diagnostic steps. If a question names Ridge regression as the answer, the solver flags the order violation in its explanation.

This was not invented for the agent — it comes from the course instructor’s teaching sequence. The agent learned it by reading the actual course materials and codified it as a hard constraint. That is the right way to build a domain-specific tool: derive constraints from the domain, not from general ML defaults.

The benchmark in detail

The 30/30 result came from a three-stage process:

Extract the exam. The Canvas quiz submissions API was used to pull all 19 final-exam questions, including embedded images (a churn model output PNG for Q3) and data file references. The actual CSV and XLSX data files were downloaded from BYU Box.
Solve with computation. The solver ran pandas, statsmodels, and sklearn against the real data files. Conjoint questions (Q15, Q18, Q19) were verified via Playwright against the Qualtrics simulator — the agent opened a real browser, ran the simulation, and read the result.
Triple-judge verification. Three independent models (Sonnet 4.5, GPT-5.4, Opus 4.7) were each given only the question, the data, and the pipeline’s answers — no canonical key. Zero disagreement across all 19 questions.

The bake-off comparison (CHANGELOG 2026-04-18 22:08): GPT-5 scored 16.8/30, Gemini 2.0 scored 15.8/30 on the same questions without data access. JD’s pipeline scored 30/30.

The 45-point gap is not about which model is smarter. It is about whether the model can touch the data. A model that reads a question about employee satisfaction and guesses from training patterns will produce a plausible answer. A solver that loads the 3,025-row CSV, runs backward variable selection, drops 8 noise predictors, checks VIFs, and prints the coefficient table will produce the right answer.

What the seven playbooks cover

Module	What it solves
`sql`	Query construction, aggregation, window functions
`regression`	OLS with the enforced Madden remedy order
`random_forest`	Feature importance, threshold decisions, precision/recall trade-offs
`logistic`	Logit, threshold optimization, confusion matrix reads
`kmeans`	Elbow method, silhouette scoring (R² is not a valid cluster metric)
`conjoint`	Utility interpretation, revenue optimization, cross-attribute comparisons
`text_analytics`	Sentiment, topic extraction, LLM prompt patterns for qualitative data

Each playbook has anti-patterns explicitly noted — the clustering module, for instance, blocks R² as a cluster quality metric because it is a common mistake that the playbook should catch.

The regression diagnostics helper

The regression playbook runs agents.regression_diagnostics before any refinement step. This is a standalone Python helper:

python3.12 -m agents.regression_diagnostics --data <csv> --target <col>

Smoke-tested on the Q5 Employee Job Satisfaction dataset (n=3,025, 14 predictors): it cleanly dropped 8 noise predictors and flagged Age and NumCompanies at VIF 160+. The printed output shows R², Adj R², gap, F p-value, max VIF, and n_significant/n_total.

That is the same output you would get in a graded statistics course. The agent learned what “correct” looks like from the course materials, not from general ML conventions.

What to take from this

The analytics suite demonstrates one pattern worth replicating in any domain where accuracy requires computation rather than recall:

The model’s job is to plan and interpret. The tools’ job is to compute. Keep those roles separate and you can verify each one independently. The problem parser classifies the question type. The playbook selects the method and sequence. The solver runs the computation. The verifier checks the result. The marimo notebook exports the work so a human can audit every step.

None of that requires a smarter model. It requires the right separation of concerns and the discipline to enforce the diagnostic sequence before escalating.

Next: From data analysis to physical health. The Health Stack applies the same pattern — KB-backed specialist agent, goal-file injection, evidence-tiered output — to fitness and peptide protocols.