Skip to content

Comments

Plugin system (groups of skills, MCP servers)#97

Open
joellabes wants to merge 51 commits intomainfrom
joellabes/add-claude-skill
Open

Plugin system (groups of skills, MCP servers)#97
joellabes wants to merge 51 commits intomainfrom
joellabes/add-claude-skill

Conversation

@joellabes
Copy link
Collaborator

I keep seeing the LLMs make dumb assumptions about dbt's behaviour, so wanted to try using Skills to give it extra context.

In theory, claude automatically recognises when a skill would be appropriate. In practice, it doesn't unless you yell at it 🤦. So we shouldn't merge this as-is.

@jasnonaz
Copy link
Collaborator

This is great - we should break out skills though so it's not baked into the Claude agent - which is how we test baseline / unaugmented Claude performance. We should have some mechanism to test by agent with / without skills.

@bstancil
Copy link
Collaborator

bstancil commented Dec 15, 2025

@joellabes @jasnonaz - @clkao had an earlier PR for a concept of plugins that I believe incorporated skills. I don't have strong opinions about these, though I do bias a bit towards more general concepts so that they could be extended to other types of products (eg, if gemini adds something like skills, etc). It might be worth looking at that implementation and seeing if there are generalizations for this: #42

@joellabes
Copy link
Collaborator Author

Oh yeah for sure! Would happy adopt @clkao's setup instead - I'll just keep noodling with the skill text in the meantime

@bstancil
Copy link
Collaborator

Cool, @clkao , you have any thoughts on this?

joellabes and others added 18 commits January 28, 2026 10:53
Replaces --use-mcp and --use-skills flags with YAML-configured
plugin sets for A/B comparison of agent performance.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
12 tasks with TDD approach, exact file paths, and code snippets.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Defines four skill sets:
- no-plugins: baseline without skills or MCP (default)
- dbt-skills: Claude-only, installs dbt agent skills
- dbt-mcp: configures dbt MCP server (default)
- dbt-full: both skills and MCP for Claude

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace used_mcp field with skill_set_name, skill_set_skills,
and skill_set_mcp_servers for richer experiment tracking.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The new --plugin-set flag accepts skill set names from skill-sets.yaml.
When not specified, uses all default skill sets for A/B comparison.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace use_mcp/use_skills params with plugin_set_names
- Add _init_skill_sets() to load skill sets from YAML config
- Refactor run() to iterate over skill sets with suffixed run IDs
- Move existing run logic to _execute_trials()
- Add skill set metadata (name, skills, mcp_servers) to TrialResults
- Derive use_skills from current skill set configuration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace use_skills parameter with skill_set
- Add SkillsHandler and McpHandler imports
- Call handlers to install skills and configure MCP servers
- Remove obsolete _install_skills_via_cli from agent_setup.py
- Update harness to pass skill_set to orchestrator

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Logic has moved to McpHandler in ade_bench/plugins/mcp_handler.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename all "skill set" terminology to "plugin set" for clarity:
- SkillSet -> PluginSet
- SkillSetsConfig -> PluginSetsConfig
- SkillSetLoader -> PluginSetLoader
- skill-sets.yaml -> plugin-sets.yaml
- Related variables, methods, and test names updated

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The used_mcp field was removed when implementing the plugin set system.
Replace with plugin_set_name to fix the AttributeError in CI.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
joellabes and others added 4 commits February 4, 2026 07:56
The dbt-mcp server requires DBT_PROJECT_DIR and DBT_PATH environment
variables to locate the dbt project and executable. The plugin-based
MCP configuration was missing these dynamic values that the old
setup-dbt-mcp.sh script set.

Changes:
- Add _get_dbt_dynamic_env() to query container for dbt path
- Set DBT_PROJECT_DIR to container app directory (/app)
- Set DBT_PATH from `which dbt` in container
- Set DISABLE_DBT_CLI=false to enable dbt CLI in dbt-mcp
- Detect dbt MCP servers by name or dbt-mcp in args
- Merge dynamic vars with static config from plugin-sets.yaml

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@joellabes joellabes marked this pull request as ready for review February 5, 2026 03:20
@joellabes joellabes marked this pull request as draft February 6, 2026 01:49
@clkao
Copy link
Contributor

clkao commented Feb 6, 2026

Hi @joellabes, I am finally back at this. I will take a look at this PR. is this skill-specific, or also provides some hooks for plugin authors to enable mcp and other stuff?

@joellabes
Copy link
Collaborator Author

joellabes commented Feb 6, 2026

@clkao it started as just skills but is now more all-purpose (can do MCP and skills today, and I believe it's extensible to LSPs and subagents and anything else)

I think it's a spiritual blend of your PR plus @b-per's evals config for the dbt-agent-skills repo

@clkao
Copy link
Contributor

clkao commented Feb 6, 2026

@joellabes great! i noticed you use superpowers as well! ✨ did you mean to merge the plan plans too?

another Q: are there ways to extend plugin-sets.yaml without modifying it? i'd imagine one fiddling with the some local config since this can also customize prompts

@clkao
Copy link
Contributor

clkao commented Feb 6, 2026

I tested the branch with --plugin-set all-dbt-skills with the @demo set and it worked. I can see the one task (airbnb005) loading the relevant skill, but for some reasons no skills are triggered for analytics_engineering002.medium, f1003.hard and intercom002

@clkao
Copy link
Contributor

clkao commented Feb 6, 2026

I think we should have the container image names with the plugin-set in it, so runs with different set won't conflict.

@joellabes
Copy link
Collaborator Author

did you mean to merge the plan plans too?

I have kept them around during development and review (they turned out to be useful, MCP wasn't working and a new Claude instance noticed that past-Claude had not implemented all steps of the plan)

But I do not intend to keep them around post-merge.

another Q: are there ways to extend plugin-sets.yaml without modifying it? i'd imagine one fiddling with the some local config since this can also customize prompts

You need to modify the YAML file to add new plugin-sets, but you don't have to commit them. I think I'm missing some context behind your question - why is extending plugin sets without modifying the file a valuable goal?

I tested the branch with --plugin-set all-dbt-skills with the @demo set and it worked. I can see the one task (airbnb005) loading the relevant skill, but for some reasons no skills are triggered for analytics_engineering002.medium, f1003.hard and intercom002

stochasticism, amirite? When I ran with the dbt-for-ae plugin set, all three of those tasks successfully loaded the skill. So I think some of this is bad luck and the kind of thing we're crossing our fingers for Anthropic to improve

I think we should have the container image names with the plugin-set in it, so runs with different set won't conflict.

Strong agree! Discussed further in #96 (comment) but I will approach that in a different PR because this one is already a monster.

@joellabes joellabes requested a review from Copilot February 9, 2026 02:08

This comment was marked as outdated.

@joellabes
Copy link
Collaborator Author

My questions while scrolling through the diff, and Claude's responses

1. Could transcript generation be more generalised for other LLMs?

Yes, and this is worth addressing. The current approach has two layers of coupling:
  • claude-code-transcripts library: Produces multiple HTML files (page-001.html, etc.) in a directory. The generate_results_html.py script hardcodes a check for sessions/transcript/page-001.html and copies the entire transcript/
    directory.

  • The log formatter hierarchy is already partially generalizedLogFormatter is abstract with a default no-op generate_html_transcript(). But the HTML results pipeline (generate_results_html.py:318-328) assumes the Claude-specific transcript directory structure.

Recommendation: Define a contract where any agent's transcript is a single HTML file at a well-known path (e.g., sessions/transcript.html). The generate_html_transcript() method would be responsible for producing that single file — the Claude implementation would call claude-code-transcripts internally and then consolidate/inline the output. This decouples the results pipeline from any library's output structure. Other agents (Gemini, Codex) could produce their own single-file transcripts using whatever tools they have.

Decision: ✅ Change to standard well-known html file location

2. Could log extraction happen earlier so JSONL is the only thing that comes back from Docker?

You're right that it can't easily, and the current approach is reasonable. Here's why:
  • The agent runs inside tmux via tee to /tmp/agent_output.log (abstract_installed_agent.py:225), which captures the raw terminal output including both JSONL and shell noise.
  • The harness captures the full tmux pane content after the agent finishes (harness.py:756), splits on a delimiter, and writes the post-agent portion as agent.log.
  • ClaudeCodeLogFormatter.extract_jsonl_content() then strips non-JSON lines from this mixed file.

The fundamental problem is that tmux pane capture is the mechanism for getting data out of the container — you can't easily have the container write a clean JSONL file that's accessible outside. You'd need either a volume mount for the log
file or a docker cp step. Both are possible but add complexity. The current "capture pane, strip non-JSON" approach is pragmatic.

One improvement: The agent could also write --output-file /tmp/agent.jsonl (if Claude CLI supports it) and then docker cp that file out. This would give you a perfectly clean JSONL without needing extract_jsonl_content(). But this
is optimization, not a bug.

Decision: ⌛ Ignore trying to generate separate jsonl file for now

3. Are the __init__ arguments on AbstractInstalledAgent appropriate?

Three explicit params + kwargs is fine. The real concern is the overall flow from harness to agent:

The signature is:

def __init__(self, model_name: str | None = None, allowed_tools: list[str] | None = None, mcp_servers: dict[str, McpServerConfig] | None = None, **kwargs)
  • The harness stuffs allowed_tools and mcp_servers into agent_kwargs dict (harness.py:188-192), which flows through AgentFactory.get_agent(**kwargs) and lands in AbstractInstalledAgent.__init__(**kwargs).
  • This kwargs-passing pattern means there's no type checking at the factory boundary. If you typo allowed_tool instead of allowed_tools, it silently gets absorbed by **kwargs somewhere in the chain.

Recommendation: The __init__ arg count is fine. The more pressing issue is the loose **kwargs pipeline. A config dataclass (e.g., AgentConfig with model_name, allowed_tools, mcp_servers) would make the interface explicit and catch typos at construction time. But that's a broader refactor, not specific to this PR.

Decision: ⌛ Ignore refactor for now

4. Why is ade_bench/models only present for PluginSet and init?

This is a divergence from the existing pattern.

The codebase defines its other Pydantic models in:

  • harness_models.pyFailureMode, RunMetadata, TrialResults, BenchmarkResults, VariantConfig, TaskMetadata, TerminalCommand (8 models in one file)
  • agents/base_agent.pyAgentResult
  • handlers/trial_handler.pyTask, TaskPrompt, TaskDifficulty

None of these use a models/ package. The new ade_bench/models/plugin_set.py is the only thing in ade_bench/models/.

This isn't wrong — arguably it's the start of a better pattern (dedicated models package). But it does feel inconsistent. Either:

  • Move the plugin set models into harness_models.py (consistent with current convention), or
  • Commit to the models/ package and plan to migrate other models there over time.

I'd lean toward keeping models/ as the beginning of a migration, but add a comment in models/__init__.py noting the intent.

Decision: ❌ Stick with existing pattern instead of migrating

5. Will the plugin set concept expand cleanly for LSP, hooks, agents, etc.?

Partially

Looking at the Claude Code plugin structure, a full plugin can contain:

Component Current PluginSet support Gap
Skills (skills/) Yes — SkillOrigin None
MCP servers (.mcp.json) Yes — McpServerConfig None
Commands (commands/) No Would need adding
Agents (agents/) No Would need adding
Hooks (hooks/hooks.json) No Would need adding
LSP servers (.lsp.json) No Would need adding

The PluginSet model is currently a flat configuration object, not a representation of a Claude Code plugin. It works well for the current "skills + MCP" scope. To support the full plugin spec, you'd likely want to shift from the current
"list of individual skills + list of MCP servers" model to something closer to "list of plugin directories to install," since that's how Claude Code actually consumes them.

Recommendation: For now this is fine — you're testing skills and MCP. If/when you need hooks or LSP, consider whether the PluginSet should evolve into a PluginConfig that points to actual plugin directories (which naturally support all component types) rather than enumerating each component type separately.

Decision: ✅ Leave as-is for now, consider different config in future

6. Are there assumptions about artifact shape that are no longer true?

Yes, a few:
  1. generate_results_html.py:320-328: Hardcodes sessions/transcript/page-001.html. If claude-code-transcripts changes its output format (e.g., produces index.html instead), the HTML pipeline breaks silently (no transcript shown, falls back to raw text). The ClaudeCodeLogFormatter already handles both index.html and page-001.html (log_formatter.py:329-336), but the HTML generator only checks for page-001.html.

  2. summarize_results.py and TSV output: The plugin_set column in TSV replaces the old used_mcp column. The analyze.py migration script references plugin_set in its column order. If you have old experiment results with used_mcp, the migration path in analyze.py should handle this — worth verifying.

  3. Tool tags in HTML: The HTML summary now renders tools_used as <span class="tool-tag"> elements. This assumes tools_used is a flat list of strings. The skill: prefix convention (e.g., skill:using-dbt-for-analytics-engineering) is a display concern baked into the data model — if you later want to distinguish tool types in the UI, you'd need to parse this prefix back out.

  4. _GENERIC_TOOLS filter (claude_code_agent.py:82-86): This is Claude-specific. If Gemini or Codex agents get tool extraction, they'd need their own filter lists. Currently those agents don't implement extract_tools_used() or format_agent_log(), so it's fine for now.

Decision: ✅ None of these are blockers

7. Is there a better way to handle prompt_suffix?

The current approach loses information

Here's the flow:

  1. PluginSet.prompt_suffix is defined in YAML
  2. In _run_agent() (harness.py:501-503), it's extracted from the current plugin set
  3. In _run_agent_with_timeout() (harness.py:474-475), it's concatenated: full_prompt = f"{task_prompt}\n\n{prompt_suffix}"
  4. The agent receives only full_prompt — it has no knowledge of what was original vs. appended
  5. TrialResults.task_prompt (harness.py:653) stores the original task prompt (before suffix)

What's missing: The suffix and the combined prompt are not persisted anywhere. If you look at results later, you can infer the suffix from plugin_set_name, but you can't see the exact text that was sent.

Recommendation: Add prompt_suffix and/or full_prompt fields to TrialResults. This gives you full traceability: original prompt, suffix, and what was actually sent. The cost is a small amount of data duplication but the debugging value is high.

Decision: ✅ Store prompt_suffix in results if present, alongside original prompt

8. Are there READMEs or documentation that are now out of date?

Yes
  1. CLAUDE.md (this project's instructions): Still references --use-mcp implicitly through the "Key Parameters" section. The --plugin-set flag should be documented there.

  2. README.md (main project): Contains usage examples that still show --use-mcp (lines 115, 162, 400-407) and don't mention --plugin-set.

  3. shared/scripts/setup-dbt-mcp.sh: This file was deleted in the PR (the old MCP setup script). If any documentation references it, those references are stale.

Decision: ✅ Update docs

9. Are the pytests appropriate?

Good coverage of the new models and plugin loading, but significant gaps in the new behavior:

Well-tested:

  • test_plugin_set.py: Pydantic model construction, install_all(), agent compatibility, YAML deserialization, lookup methods. Solid.
  • test_loader.py: Happy path loading, missing files, default resolution, agent compatibility filtering. Good.
  • test_skills_handler.py: No-op for empty skills, install-all vs. specific skills, multiple origins, failure resilience. Good.

Not tested at all:

  • _configure_mcp_servers() — the entire MCP setup path (env file writing, claude mcp add command construction, dynamic dbt env discovery). This is complex code with shell commands and would benefit from at least mocked tests.
  • extract_tools_used() — tool extraction with _GENERIC_TOOLS filtering and Skill -> skill:name expansion. Easy to unit test with fixture JSONL data.
  • ClaudeCodeLogFormatter.extract_jsonl_content() — the JSONL extraction logic including synthetic prompt injection. This is non-trivial parsing logic.
  • generate_html_transcript() — can be tested with a mock of claude_code_transcripts.
  • Prompt suffix injection — no test verifying that _run_agent_with_timeout correctly concatenates the suffix.
  • TSV writer changes — the tools and plugin_set columns.
  • Harness multi-plugin-set iteration — the __ suffixed run IDs and separate output directories.

Recommendation: Prioritize tests for extract_jsonl_content() (parsing logic), extract_tools_used() (easy wins), and _configure_mcp_servers() (complex shell interaction). These are the areas most likely to regress.

Decision: 🤏 Add tests for _configure_mcp_servers only

Additional Observations

_configure_mcp_servers is coupled to Claude CLI

abstract_installed_agent.py:96 uses self.NAME.value (e.g., "claude") as the CLI binary name for mcp add. This assumes all installed agents use the same <agent_name> mcp add command syntax. Gemini CLI doesn't support this — it has
no MCP integration via CLI. The agents filter on PluginSet handles this at the config level, but the code in AbstractInstalledAgent would still try to run gemini mcp add if someone misconfigures a plugin set. Consider either:

  • Moving _configure_mcp_servers() to ClaudeCodeAgent specifically, or
  • Adding a guard that checks whether the agent supports MCP before calling it.

Decision: ❌ Do nothing

hasattr(task_agent, '_log_formatter') is a code smell

harness.py:763 uses hasattr to duck-type check for log formatter support. This is fragile — it checks for a private attribute. A cleaner approach would be an optional method on BaseAgent (e.g., supports_log_formatting() -> bool) or
making format_agent_log() and extract_tools_used() part of the BaseAgent interface with default no-op implementations. (Note: these methods already exist on BaseAgent with no-op defaults — the hasattr check is unnecessary.)

Decision: ✅ Fix to use existing no-op

The --env-file flag in MCP add command

abstract_installed_agent.py:133: The constructed command puts --env-file between the MCP command and its args. This works for uvx but depends on the runner not consuming the flag itself. If uvx interprets --env-file itself rather
than passing it to the subprocess, this could break. Worth verifying if other MCP commands are added.

Decision: ❌ Leave as-is and assume uv will continue to work as it does

joellabes and others added 3 commits February 9, 2026 16:13
- Move plugin set models from ade_bench/models/ to harness_models.py
  (consistent with existing convention)
- Change default plugin set to 'none' only (avoid double-run footgun)
- Persist prompt_suffix in TrialResults and TSV output for traceability
- Replace hasattr(_log_formatter) duck-typing with BaseAgent interface
  methods (format_agent_log, extract_tools_used already have no-op defaults)
- Standardize transcript output to single sessions/transcript.html file
  instead of agent-specific directory structure
- Update CLAUDE.md and README.md: replace --use-mcp with --plugin-set
- Add unit tests for _configure_mcp_servers (8 test cases)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@joellabes joellabes marked this pull request as ready for review February 9, 2026 03:23
@joellabes
Copy link
Collaborator Author

@bstancil Do you want to give this a gander? Finally ready to rumble

@clkao
Copy link
Contributor

clkao commented Feb 10, 2026

I think we should have the container image names with the plugin-set in it, so runs with different set won't conflict.

Strong agree! Discussed further in #96 (comment) but I will approach that in a different PR because this one is already a monster.

created #118

@joellabes joellabes changed the title Experimenting with Claude skills Plugin system (groups of skills, MCP servers) Feb 12, 2026
joellabes and others added 5 commits February 23, 2026 13:35
Resolve conflicts from black formatting and ruff lint changes on main,
keeping plugin set system from this branch (replacing use_mcp boolean).
Also incorporate --log-level CLI option added on main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bstancil
Copy link
Collaborator

Very in favor to this, as MCP was the hotness when this got built, and now skills are definitely the new hotness instead. (And honestly, I'm not sure how much longer MCP will really matter: https://trends.google.com/explore?q=claude%2520skills%2Cclaude%2520mcp&date=today%205-y&geo=Worldwide)

Also, on these things, two more general points:

  1. Y'all should definitely feel free to move things forward without me. Happy to look if I need to, but feel free to own this, and be the ultimate deciders.
  2. Along those lines, if there are broader rewrites of things you (or Claude) want to do, by all means. Most of this code was written 6 months ago with like, Cursor, line-by-line reviews, and Opus 4 or something. So it's entirely possible that Claude Code + Opus 4.6 will want to do things in very different ways, and if it does, who am I to disagree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants