Releases: fredchu/claude-automl
v5.7.0 — Environment Gap Research Gate
What's New
Environment Gap Research Gate (Phase 1 Step B')
When SCD C5 (Test Environment Gap) confidence is low, automl now forces a research step before designing evaluators. This prevents the "guess → deploy → fail → repeat" anti-pattern that wastes hours on platform-specific bugs.
Trigger: C5 confidence < 7 (agent admits uncertainty about test-vs-production gap)
Tool priority: NotebookLM deep research (300+ sources + multi-round Q&A) → WebSearch fallback (3-5 rounds)
Origin story: MumbleKey iOS background recording debugging — 6 rounds of guessing failed, one NLM research session found all 6 root causes at once. Lesson learned.
Phase 2 Emergency Research Gate
When a subagent is stuck and the cause can't be diagnosed from code alone, automl pauses the loop and triggers research before continuing.
NLM Pre-Create Routing
Research Gate respects the /notebooklm skill's notebook registry — checks for existing relevant notebooks before creating new ones. No duplicate notebooks.
Recommended: notebooklm-py
pip install notebooklm-py && notebooklm skill install — strongly recommended for best Research Gate results. Falls back to WebSearch if not installed.
See CHANGELOG.md for full details and migration guide.
v5.6.0 — System Context Dialogue + Coverage Sanity Check
What's New
System Context Dialogue (Phase 1 Step B, mandatory)
FMEA-driven conversation with user before evaluator design. Five categories: Trigger Paths, Dependencies & Latency, Environmental Constraints, History & Workarounds, Test Environment Gap. Agent self-assesses confidence per category, asks 3-5 targeted questions. Rapid RPN scoring (S×O×D, max 125) prioritizes failure modes.
Red Team Blind Spots
After gaming attempts, red team now identifies production scenarios that would fail but evaluators can't catch. Results flow to injectable tests, risk scenarios, or manual verification checklist.
Phase 1.5b' — Coverage Sanity Check
Fixes structural blind spot where SCD (failure-focused) + red team (gaming-focused) systematically miss happy paths. Three mechanical checks:
- CHECK 1: Happy Path Coverage — at least one test verifying normal successful completion
- CHECK 2: Alternative Path Parity — each trigger path has at least one test
- CHECK 3: State Sequence Robustness — interrupt-then-retry test for state machines
Also includes v5.5 features
required_tests— red team findings become structured test requirementsmethodology_skill— TDD as methodology separated from domain skill- TDD staging (RED→GREEN→REFACTOR) enforced with required_tests
See CHANGELOG.md for full details and migration guide.
v4.0.0 — Mandatory Skills + Phase 3 Subagent Architecture
What's New
Mandatory Skills — Each task now requires a specialized skill (e.g., /investigate for debugging, /review for code review). Subagents load skills via the Skill tool before making changes. Default is "skill required" — none is the exception that needs justification.
Phase 3 Subagent Architecture — Phase 3 verification is no longer done by the main session. Three specialized subagents handle it:
- FINAL_VERIFICATION (haiku) — re-runs all evaluators + risk scenario test cases
- RISK_REVIEW (opus) — traces each risk scenario through actual code paths
- CODE_REVIEW (codex-worker / sonnet fallback) — diff-aware review with security analysis
Model Routing — Every Agent call specifies a model: haiku for mechanical tasks, sonnet for execution, opus for deep analysis. User-overridable via params.model_overrides.
Skill Mapping Table — New references/skill-mapping.md provides task type → recommended skill lookup for Phase 1, 2, and 3.
gstack Integration — Phase 0 adds /design-consultation, Phase 1 defaults to /autoplan, Phase 2/3 integrate /investigate, /review, /cso, /qa-only, /benchmark.
Breaking Changes
- Fourth required element: mandatory skill per task
- State file format changed (new fields:
skill,phase3_skill,risk_scenarios,phase3block) - Phase 3 has full dispatcher decision tree in main skill file
Upgrade
cd ~/.claude/skills/automl && git pullExisting v3 runs continue in v3 mode. New runs use v4 format.
See CHANGELOG.md for full details.
v3.1.0 — Risk Scenarios & Verification Checklist
What's New
Phase 1: Risk Scenarios — each task now requires 3-5 "how could this break?" scenarios that auto-flow into evaluators and Phase 3 checklists. Works for code, text, config — any domain.
Phase 3: Risk Scenario Review — mandatory: trace each scenario through actual implementation, rating ✅ Safe or 🔴 Bug.
Phase 3: Verification Checklist — mandatory output: prioritized test list (crash > data loss > UX > cosmetic). User runs full list in one pass, minimizing ping-pong.
Why
In a real iOS architecture rewrite, automl v3.0.0 caught build errors but missed a race condition, an audio session leak, and two state machine bugs. All 5 bugs would have been caught by upfront risk scenario analysis.
Upgrade
cd ~/.claude/skills/automl && git pullFull changelog: CHANGELOG.md
v3.0.0 — Initial Release
Autonomous Evaluation Loop for Claude Code
Define success. Let the agent iterate until it gets there.
Features
- Dual-loop engine — per-task improvement + cross-task regression check
- Two evaluator modes — shell (exit code / score) and checklist (LLM-as-judge)
- Subagent architecture — main session dispatches only, never touches code directly
- Auto-resume — state persists in
.automl/{run_id}/, interrupted sessions continue automatically - Safety — git tag baseline, whitelist scope, STOP file interrupt, non-git fallback
- Optional skill integrations — Phase 0/1/3 can chain with external brainstorming, planning, and review skills
Install
git clone https://github.com/fredchu/claude-automl ~/.claude/skills/automlQuick Start
/automl make all tests pass
evaluator: pytest tests/ -q
scope: src/
See README for full documentation (English + 繁體中文).