Plugin system (groups of skills, MCP servers)#97
Conversation
|
This is great - we should break out skills though so it's not baked into the Claude agent - which is how we test baseline / unaugmented Claude performance. We should have some mechanism to test by agent with / without skills. |
|
@joellabes @jasnonaz - @clkao had an earlier PR for a concept of plugins that I believe incorporated skills. I don't have strong opinions about these, though I do bias a bit towards more general concepts so that they could be extended to other types of products (eg, if gemini adds something like skills, etc). It might be worth looking at that implementation and seeing if there are generalizations for this: #42 |
|
Oh yeah for sure! Would happy adopt @clkao's setup instead - I'll just keep noodling with the skill text in the meantime |
|
Cool, @clkao , you have any thoughts on this? |
Replaces --use-mcp and --use-skills flags with YAML-configured plugin sets for A/B comparison of agent performance. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
12 tasks with TDD approach, exact file paths, and code snippets. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Defines four skill sets: - no-plugins: baseline without skills or MCP (default) - dbt-skills: Claude-only, installs dbt agent skills - dbt-mcp: configures dbt MCP server (default) - dbt-full: both skills and MCP for Claude Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace used_mcp field with skill_set_name, skill_set_skills, and skill_set_mcp_servers for richer experiment tracking. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The new --plugin-set flag accepts skill set names from skill-sets.yaml. When not specified, uses all default skill sets for A/B comparison. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace use_mcp/use_skills params with plugin_set_names - Add _init_skill_sets() to load skill sets from YAML config - Refactor run() to iterate over skill sets with suffixed run IDs - Move existing run logic to _execute_trials() - Add skill set metadata (name, skills, mcp_servers) to TrialResults - Derive use_skills from current skill set configuration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace use_skills parameter with skill_set - Add SkillsHandler and McpHandler imports - Call handlers to install skills and configure MCP servers - Remove obsolete _install_skills_via_cli from agent_setup.py - Update harness to pass skill_set to orchestrator Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Logic has moved to McpHandler in ade_bench/plugins/mcp_handler.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename all "skill set" terminology to "plugin set" for clarity: - SkillSet -> PluginSet - SkillSetsConfig -> PluginSetsConfig - SkillSetLoader -> PluginSetLoader - skill-sets.yaml -> plugin-sets.yaml - Related variables, methods, and test names updated Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The used_mcp field was removed when implementing the plugin set system. Replace with plugin_set_name to fix the AttributeError in CI. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The dbt-mcp server requires DBT_PROJECT_DIR and DBT_PATH environment variables to locate the dbt project and executable. The plugin-based MCP configuration was missing these dynamic values that the old setup-dbt-mcp.sh script set. Changes: - Add _get_dbt_dynamic_env() to query container for dbt path - Set DBT_PROJECT_DIR to container app directory (/app) - Set DBT_PATH from `which dbt` in container - Set DISABLE_DBT_CLI=false to enable dbt CLI in dbt-mcp - Detect dbt MCP servers by name or dbt-mcp in args - Merge dynamic vars with static config from plugin-sets.yaml Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Hi @joellabes, I am finally back at this. I will take a look at this PR. is this skill-specific, or also provides some hooks for plugin authors to enable mcp and other stuff? |
|
@joellabes great! i noticed you use superpowers as well! ✨ did you mean to merge the plan plans too? another Q: are there ways to extend plugin-sets.yaml without modifying it? i'd imagine one fiddling with the some local config since this can also customize prompts |
|
I tested the branch with --plugin-set all-dbt-skills with the @demo set and it worked. I can see the one task (airbnb005) loading the relevant skill, but for some reasons no skills are triggered for analytics_engineering002.medium, f1003.hard and intercom002 |
|
I think we should have the container image names with the plugin-set in it, so runs with different set won't conflict. |
I have kept them around during development and review (they turned out to be useful, MCP wasn't working and a new Claude instance noticed that past-Claude had not implemented all steps of the plan) But I do not intend to keep them around post-merge.
You need to modify the YAML file to add new plugin-sets, but you don't have to commit them. I think I'm missing some context behind your question - why is extending plugin sets without modifying the file a valuable goal?
stochasticism, amirite? When I ran with the
Strong agree! Discussed further in #96 (comment) but I will approach that in a different PR because this one is already a monster. |
|
My questions while scrolling through the diff, and Claude's responses 1. Could transcript generation be more generalised for other LLMs?Yes, and this is worth addressing. The current approach has two layers of coupling:
Recommendation: Define a contract where any agent's transcript is a single HTML file at a well-known path (e.g., Decision: ✅ Change to standard well-known html file location 2. Could log extraction happen earlier so JSONL is the only thing that comes back from Docker?You're right that it can't easily, and the current approach is reasonable. Here's why:
The fundamental problem is that tmux pane capture is the mechanism for getting data out of the container — you can't easily have the container write a clean JSONL file that's accessible outside. You'd need either a volume mount for the log One improvement: The agent could also write Decision: ⌛ Ignore trying to generate separate jsonl file for now 3. Are the
|
| Component | Current PluginSet support |
Gap |
|---|---|---|
Skills (skills/) |
Yes — SkillOrigin |
None |
MCP servers (.mcp.json) |
Yes — McpServerConfig |
None |
Commands (commands/) |
No | Would need adding |
Agents (agents/) |
No | Would need adding |
Hooks (hooks/hooks.json) |
No | Would need adding |
LSP servers (.lsp.json) |
No | Would need adding |
The PluginSet model is currently a flat configuration object, not a representation of a Claude Code plugin. It works well for the current "skills + MCP" scope. To support the full plugin spec, you'd likely want to shift from the current
"list of individual skills + list of MCP servers" model to something closer to "list of plugin directories to install," since that's how Claude Code actually consumes them.
Recommendation: For now this is fine — you're testing skills and MCP. If/when you need hooks or LSP, consider whether the PluginSet should evolve into a PluginConfig that points to actual plugin directories (which naturally support all component types) rather than enumerating each component type separately.
Decision: ✅ Leave as-is for now, consider different config in future
6. Are there assumptions about artifact shape that are no longer true?
Yes, a few:
-
generate_results_html.py:320-328: Hardcodessessions/transcript/page-001.html. Ifclaude-code-transcriptschanges its output format (e.g., producesindex.htmlinstead), the HTML pipeline breaks silently (no transcript shown, falls back to raw text). TheClaudeCodeLogFormatteralready handles bothindex.htmlandpage-001.html(log_formatter.py:329-336), but the HTML generator only checks forpage-001.html. -
summarize_results.pyand TSV output: Theplugin_setcolumn in TSV replaces the oldused_mcpcolumn. Theanalyze.pymigration script referencesplugin_setin its column order. If you have old experiment results withused_mcp, the migration path inanalyze.pyshould handle this — worth verifying. -
Tool tags in HTML: The HTML summary now renders
tools_usedas<span class="tool-tag">elements. This assumestools_usedis a flat list of strings. Theskill:prefix convention (e.g.,skill:using-dbt-for-analytics-engineering) is a display concern baked into the data model — if you later want to distinguish tool types in the UI, you'd need to parse this prefix back out. -
_GENERIC_TOOLSfilter (claude_code_agent.py:82-86): This is Claude-specific. If Gemini or Codex agents get tool extraction, they'd need their own filter lists. Currently those agents don't implementextract_tools_used()orformat_agent_log(), so it's fine for now.
Decision: ✅ None of these are blockers
7. Is there a better way to handle prompt_suffix?
The current approach loses information
Here's the flow:
PluginSet.prompt_suffixis defined in YAML- In
_run_agent()(harness.py:501-503), it's extracted from the current plugin set - In
_run_agent_with_timeout()(harness.py:474-475), it's concatenated:full_prompt = f"{task_prompt}\n\n{prompt_suffix}" - The agent receives only
full_prompt— it has no knowledge of what was original vs. appended TrialResults.task_prompt(harness.py:653) stores the original task prompt (before suffix)
What's missing: The suffix and the combined prompt are not persisted anywhere. If you look at results later, you can infer the suffix from plugin_set_name, but you can't see the exact text that was sent.
Recommendation: Add prompt_suffix and/or full_prompt fields to TrialResults. This gives you full traceability: original prompt, suffix, and what was actually sent. The cost is a small amount of data duplication but the debugging value is high.
Decision: ✅ Store prompt_suffix in results if present, alongside original prompt
8. Are there READMEs or documentation that are now out of date?
Yes
-
CLAUDE.md(this project's instructions): Still references--use-mcpimplicitly through the "Key Parameters" section. The--plugin-setflag should be documented there. -
README.md(main project): Contains usage examples that still show--use-mcp(lines 115, 162, 400-407) and don't mention--plugin-set. -
shared/scripts/setup-dbt-mcp.sh: This file was deleted in the PR (the old MCP setup script). If any documentation references it, those references are stale.
Decision: ✅ Update docs
9. Are the pytests appropriate?
Good coverage of the new models and plugin loading, but significant gaps in the new behavior:
Well-tested:
test_plugin_set.py: Pydantic model construction,install_all(), agent compatibility, YAML deserialization, lookup methods. Solid.test_loader.py: Happy path loading, missing files, default resolution, agent compatibility filtering. Good.test_skills_handler.py: No-op for empty skills, install-all vs. specific skills, multiple origins, failure resilience. Good.
Not tested at all:
_configure_mcp_servers()— the entire MCP setup path (env file writing,claude mcp addcommand construction, dynamic dbt env discovery). This is complex code with shell commands and would benefit from at least mocked tests.extract_tools_used()— tool extraction with_GENERIC_TOOLSfiltering andSkill->skill:nameexpansion. Easy to unit test with fixture JSONL data.ClaudeCodeLogFormatter.extract_jsonl_content()— the JSONL extraction logic including synthetic prompt injection. This is non-trivial parsing logic.generate_html_transcript()— can be tested with a mock ofclaude_code_transcripts.- Prompt suffix injection — no test verifying that
_run_agent_with_timeoutcorrectly concatenates the suffix. - TSV writer changes — the
toolsandplugin_setcolumns. - Harness multi-plugin-set iteration — the
__suffixed run IDs and separate output directories.
Recommendation: Prioritize tests for extract_jsonl_content() (parsing logic), extract_tools_used() (easy wins), and _configure_mcp_servers() (complex shell interaction). These are the areas most likely to regress.
Decision: 🤏 Add tests for _configure_mcp_servers only
Additional Observations
_configure_mcp_servers is coupled to Claude CLI
abstract_installed_agent.py:96 uses self.NAME.value (e.g., "claude") as the CLI binary name for mcp add. This assumes all installed agents use the same <agent_name> mcp add command syntax. Gemini CLI doesn't support this — it has
no MCP integration via CLI. The agents filter on PluginSet handles this at the config level, but the code in AbstractInstalledAgent would still try to run gemini mcp add if someone misconfigures a plugin set. Consider either:
- Moving
_configure_mcp_servers()toClaudeCodeAgentspecifically, or - Adding a guard that checks whether the agent supports MCP before calling it.
Decision: ❌ Do nothing
hasattr(task_agent, '_log_formatter') is a code smell
harness.py:763 uses hasattr to duck-type check for log formatter support. This is fragile — it checks for a private attribute. A cleaner approach would be an optional method on BaseAgent (e.g., supports_log_formatting() -> bool) or
making format_agent_log() and extract_tools_used() part of the BaseAgent interface with default no-op implementations. (Note: these methods already exist on BaseAgent with no-op defaults — the hasattr check is unnecessary.)
Decision: ✅ Fix to use existing no-op
The --env-file flag in MCP add command
abstract_installed_agent.py:133: The constructed command puts --env-file between the MCP command and its args. This works for uvx but depends on the runner not consuming the flag itself. If uvx interprets --env-file itself rather
than passing it to the subprocess, this could break. Worth verifying if other MCP commands are added.
Decision: ❌ Leave as-is and assume uv will continue to work as it does
- Move plugin set models from ade_bench/models/ to harness_models.py (consistent with existing convention) - Change default plugin set to 'none' only (avoid double-run footgun) - Persist prompt_suffix in TrialResults and TSV output for traceability - Replace hasattr(_log_formatter) duck-typing with BaseAgent interface methods (format_agent_log, extract_tools_used already have no-op defaults) - Standardize transcript output to single sessions/transcript.html file instead of agent-specific directory structure - Update CLAUDE.md and README.md: replace --use-mcp with --plugin-set - Add unit tests for _configure_mcp_servers (8 test cases) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@bstancil Do you want to give this a gander? Finally ready to rumble |
created #118 |
Resolve conflicts from black formatting and ruff lint changes on main, keeping plugin set system from this branch (replacing use_mcp boolean). Also incorporate --log-level CLI option added on main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Very in favor to this, as MCP was the hotness when this got built, and now skills are definitely the new hotness instead. (And honestly, I'm not sure how much longer MCP will really matter: https://trends.google.com/explore?q=claude%2520skills%2Cclaude%2520mcp&date=today%205-y&geo=Worldwide) Also, on these things, two more general points:
|
I keep seeing the LLMs make dumb assumptions about dbt's behaviour, so wanted to try using Skills to give it extra context.
In theory, claude automatically recognises when a skill would be appropriate. In practice, it doesn't unless you yell at it 🤦. So we shouldn't merge this as-is.