Plugin system (groups of skills, MCP servers) by joellabes · Pull Request #97 · dbt-labs/ade-bench

joellabes · 2025-12-15T05:11:21Z

I keep seeing the LLMs make dumb assumptions about dbt's behaviour, so wanted to try using Skills to give it extra context.

In theory, claude automatically recognises when a skill would be appropriate. In practice, it doesn't unless you yell at it 🤦. So we shouldn't merge this as-is.

jasnonaz · 2025-12-15T18:36:51Z

This is great - we should break out skills though so it's not baked into the Claude agent - which is how we test baseline / unaugmented Claude performance. We should have some mechanism to test by agent with / without skills.

bstancil · 2025-12-15T22:03:38Z

@joellabes @jasnonaz - @clkao had an earlier PR for a concept of plugins that I believe incorporated skills. I don't have strong opinions about these, though I do bias a bit towards more general concepts so that they could be extended to other types of products (eg, if gemini adds something like skills, etc). It might be worth looking at that implementation and seeing if there are generalizations for this: #42

joellabes · 2025-12-16T02:05:58Z

Oh yeah for sure! Would happy adopt @clkao's setup instead - I'll just keep noodling with the skill text in the meantime

bstancil · 2025-12-16T17:31:05Z

Cool, @clkao , you have any thoughts on this?

Replaces --use-mcp and --use-skills flags with YAML-configured plugin sets for A/B comparison of agent performance. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

12 tasks with TDD approach, exact file paths, and code snippets. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Defines four skill sets: - no-plugins: baseline without skills or MCP (default) - dbt-skills: Claude-only, installs dbt agent skills - dbt-mcp: configures dbt MCP server (default) - dbt-full: both skills and MCP for Claude Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace used_mcp field with skill_set_name, skill_set_skills, and skill_set_mcp_servers for richer experiment tracking. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The new --plugin-set flag accepts skill set names from skill-sets.yaml. When not specified, uses all default skill sets for A/B comparison. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace use_mcp/use_skills params with plugin_set_names - Add _init_skill_sets() to load skill sets from YAML config - Refactor run() to iterate over skill sets with suffixed run IDs - Move existing run logic to _execute_trials() - Add skill set metadata (name, skills, mcp_servers) to TrialResults - Derive use_skills from current skill set configuration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace use_skills parameter with skill_set - Add SkillsHandler and McpHandler imports - Call handlers to install skills and configure MCP servers - Remove obsolete _install_skills_via_cli from agent_setup.py - Update harness to pass skill_set to orchestrator Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Logic has moved to McpHandler in ade_bench/plugins/mcp_handler.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Rename all "skill set" terminology to "plugin set" for clarity: - SkillSet -> PluginSet - SkillSetsConfig -> PluginSetsConfig - SkillSetLoader -> PluginSetLoader - skill-sets.yaml -> plugin-sets.yaml - Related variables, methods, and test names updated Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The used_mcp field was removed when implementing the plugin set system. Replace with plugin_set_name to fix the AttributeError in CI. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The dbt-mcp server requires DBT_PROJECT_DIR and DBT_PATH environment variables to locate the dbt project and executable. The plugin-based MCP configuration was missing these dynamic values that the old setup-dbt-mcp.sh script set. Changes: - Add _get_dbt_dynamic_env() to query container for dbt path - Set DBT_PROJECT_DIR to container app directory (/app) - Set DBT_PATH from `which dbt` in container - Set DISABLE_DBT_CLI=false to enable dbt CLI in dbt-mcp - Detect dbt MCP servers by name or dbt-mcp in args - Merge dynamic vars with static config from plugin-sets.yaml Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

clkao · 2026-02-06T03:50:14Z

Hi @joellabes, I am finally back at this. I will take a look at this PR. is this skill-specific, or also provides some hooks for plugin authors to enable mcp and other stuff?

joellabes · 2026-02-06T03:52:14Z

@clkao it started as just skills but is now more all-purpose (can do MCP and skills today, and I believe it's extensible to LSPs and subagents and anything else)

I think it's a spiritual blend of your PR plus @b-per's evals config for the dbt-agent-skills repo

clkao · 2026-02-06T05:43:57Z

@joellabes great! i noticed you use superpowers as well! ✨ did you mean to merge the plan plans too?

another Q: are there ways to extend plugin-sets.yaml without modifying it? i'd imagine one fiddling with the some local config since this can also customize prompts

clkao · 2026-02-06T05:47:32Z

I tested the branch with --plugin-set all-dbt-skills with the @demo set and it worked. I can see the one task (airbnb005) loading the relevant skill, but for some reasons no skills are triggered for analytics_engineering002.medium, f1003.hard and intercom002

clkao · 2026-02-06T05:54:31Z

I think we should have the container image names with the plugin-set in it, so runs with different set won't conflict.

joellabes · 2026-02-09T02:08:23Z

did you mean to merge the plan plans too?

I have kept them around during development and review (they turned out to be useful, MCP wasn't working and a new Claude instance noticed that past-Claude had not implemented all steps of the plan)

But I do not intend to keep them around post-merge.

another Q: are there ways to extend plugin-sets.yaml without modifying it? i'd imagine one fiddling with the some local config since this can also customize prompts

You need to modify the YAML file to add new plugin-sets, but you don't have to commit them. I think I'm missing some context behind your question - why is extending plugin sets without modifying the file a valuable goal?

I tested the branch with --plugin-set all-dbt-skills with the @demo set and it worked. I can see the one task (airbnb005) loading the relevant skill, but for some reasons no skills are triggered for analytics_engineering002.medium, f1003.hard and intercom002

stochasticism, amirite? When I ran with the dbt-for-ae plugin set, all three of those tasks successfully loaded the skill. So I think some of this is bad luck and the kind of thing we're crossing our fingers for Anthropic to improve

I think we should have the container image names with the plugin-set in it, so runs with different set won't conflict.

Strong agree! Discussed further in #96 (comment) but I will approach that in a different PR because this one is already a monster.

joellabes · 2026-02-09T03:12:11Z

My questions while scrolling through the diff, and Claude's responses

1. Could transcript generation be more generalised for other LLMs?

Yes, and this is worth addressing. The current approach has two layers of coupling:

claude-code-transcripts library: Produces multiple HTML files (page-001.html, etc.) in a directory. The generate_results_html.py script hardcodes a check for sessions/transcript/page-001.html and copies the entire transcript/
directory.
The log formatter hierarchy is already partially generalized — LogFormatter is abstract with a default no-op generate_html_transcript(). But the HTML results pipeline (generate_results_html.py:318-328) assumes the Claude-specific transcript directory structure.

Recommendation: Define a contract where any agent's transcript is a single HTML file at a well-known path (e.g., sessions/transcript.html). The generate_html_transcript() method would be responsible for producing that single file — the Claude implementation would call claude-code-transcripts internally and then consolidate/inline the output. This decouples the results pipeline from any library's output structure. Other agents (Gemini, Codex) could produce their own single-file transcripts using whatever tools they have.

Decision: ✅ Change to standard well-known html file location

2. Could log extraction happen earlier so JSONL is the only thing that comes back from Docker?

You're right that it can't easily, and the current approach is reasonable. Here's why:

The agent runs inside tmux via tee to /tmp/agent_output.log (abstract_installed_agent.py:225), which captures the raw terminal output including both JSONL and shell noise.
The harness captures the full tmux pane content after the agent finishes (harness.py:756), splits on a delimiter, and writes the post-agent portion as agent.log.
ClaudeCodeLogFormatter.extract_jsonl_content() then strips non-JSON lines from this mixed file.

The fundamental problem is that tmux pane capture is the mechanism for getting data out of the container — you can't easily have the container write a clean JSONL file that's accessible outside. You'd need either a volume mount for the log
file or a docker cp step. Both are possible but add complexity. The current "capture pane, strip non-JSON" approach is pragmatic.

One improvement: The agent could also write --output-file /tmp/agent.jsonl (if Claude CLI supports it) and then docker cp that file out. This would give you a perfectly clean JSONL without needing extract_jsonl_content(). But this
is optimization, not a bug.

Decision: ⌛ Ignore trying to generate separate jsonl file for now

3. Are the `init` arguments on `AbstractInstalledAgent` appropriate?

Three explicit params + kwargs is fine. The real concern is the overall flow from harness to agent:

The signature is:

def __init__(self, model_name: str | None = None, allowed_tools: list[str] | None = None, mcp_servers: dict[str, McpServerConfig] | None = None, **kwargs)

The harness stuffs allowed_tools and mcp_servers into agent_kwargs dict (harness.py:188-192), which flows through AgentFactory.get_agent(**kwargs) and lands in AbstractInstalledAgent.__init__(**kwargs).
This kwargs-passing pattern means there's no type checking at the factory boundary. If you typo allowed_tool instead of allowed_tools, it silently gets absorbed by **kwargs somewhere in the chain.

Recommendation: The __init__ arg count is fine. The more pressing issue is the loose **kwargs pipeline. A config dataclass (e.g., AgentConfig with model_name, allowed_tools, mcp_servers) would make the interface explicit and catch typos at construction time. But that's a broader refactor, not specific to this PR.

Decision: ⌛ Ignore refactor for now

4. Why is `ade_bench/models` only present for PluginSet and init?

This is a divergence from the existing pattern.

The codebase defines its other Pydantic models in:

harness_models.py — FailureMode, RunMetadata, TrialResults, BenchmarkResults, VariantConfig, TaskMetadata, TerminalCommand (8 models in one file)
agents/base_agent.py — AgentResult
handlers/trial_handler.py — Task, TaskPrompt, TaskDifficulty

None of these use a models/ package. The new ade_bench/models/plugin_set.py is the only thing in ade_bench/models/.

This isn't wrong — arguably it's the start of a better pattern (dedicated models package). But it does feel inconsistent. Either:

Move the plugin set models into harness_models.py (consistent with current convention), or
Commit to the models/ package and plan to migrate other models there over time.

I'd lean toward keeping models/ as the beginning of a migration, but add a comment in models/__init__.py noting the intent.

Decision: ❌ Stick with existing pattern instead of migrating

5. Will the plugin set concept expand cleanly for LSP, hooks, agents, etc.?

Partially

Looking at the Claude Code plugin structure, a full plugin can contain:

Component	Current `PluginSet` support	Gap
Skills (`skills/`)	Yes — `SkillOrigin`	None
MCP servers (`.mcp.json`)	Yes — `McpServerConfig`	None
Commands (`commands/`)	No	Would need adding
Agents (`agents/`)	No	Would need adding
Hooks (`hooks/hooks.json`)	No	Would need adding
LSP servers (`.lsp.json`)	No	Would need adding

The PluginSet model is currently a flat configuration object, not a representation of a Claude Code plugin. It works well for the current "skills + MCP" scope. To support the full plugin spec, you'd likely want to shift from the current
"list of individual skills + list of MCP servers" model to something closer to "list of plugin directories to install," since that's how Claude Code actually consumes them.

Recommendation: For now this is fine — you're testing skills and MCP. If/when you need hooks or LSP, consider whether the PluginSet should evolve into a PluginConfig that points to actual plugin directories (which naturally support all component types) rather than enumerating each component type separately.

Decision: ✅ Leave as-is for now, consider different config in future

6. Are there assumptions about artifact shape that are no longer true?

Yes, a few:

generate_results_html.py:320-328: Hardcodes sessions/transcript/page-001.html. If claude-code-transcripts changes its output format (e.g., produces index.html instead), the HTML pipeline breaks silently (no transcript shown, falls back to raw text). The ClaudeCodeLogFormatter already handles both index.html and page-001.html (log_formatter.py:329-336), but the HTML generator only checks for page-001.html.
summarize_results.py and TSV output: The plugin_set column in TSV replaces the old used_mcp column. The analyze.py migration script references plugin_set in its column order. If you have old experiment results with used_mcp, the migration path in analyze.py should handle this — worth verifying.
Tool tags in HTML: The HTML summary now renders tools_used as <span class="tool-tag"> elements. This assumes tools_used is a flat list of strings. The skill: prefix convention (e.g., skill:using-dbt-for-analytics-engineering) is a display concern baked into the data model — if you later want to distinguish tool types in the UI, you'd need to parse this prefix back out.
_GENERIC_TOOLS filter (claude_code_agent.py:82-86): This is Claude-specific. If Gemini or Codex agents get tool extraction, they'd need their own filter lists. Currently those agents don't implement extract_tools_used() or format_agent_log(), so it's fine for now.

Decision: ✅ None of these are blockers

7. Is there a better way to handle prompt_suffix?

The current approach loses information

Here's the flow:

PluginSet.prompt_suffix is defined in YAML
In _run_agent() (harness.py:501-503), it's extracted from the current plugin set
In _run_agent_with_timeout() (harness.py:474-475), it's concatenated: full_prompt = f"{task_prompt}\n\n{prompt_suffix}"
The agent receives only full_prompt — it has no knowledge of what was original vs. appended
TrialResults.task_prompt (harness.py:653) stores the original task prompt (before suffix)

What's missing: The suffix and the combined prompt are not persisted anywhere. If you look at results later, you can infer the suffix from plugin_set_name, but you can't see the exact text that was sent.

Recommendation: Add prompt_suffix and/or full_prompt fields to TrialResults. This gives you full traceability: original prompt, suffix, and what was actually sent. The cost is a small amount of data duplication but the debugging value is high.

Decision: ✅ Store prompt_suffix in results if present, alongside original prompt

8. Are there READMEs or documentation that are now out of date?

Yes

CLAUDE.md (this project's instructions): Still references --use-mcp implicitly through the "Key Parameters" section. The --plugin-set flag should be documented there.
README.md (main project): Contains usage examples that still show --use-mcp (lines 115, 162, 400-407) and don't mention --plugin-set.
shared/scripts/setup-dbt-mcp.sh: This file was deleted in the PR (the old MCP setup script). If any documentation references it, those references are stale.

Decision: ✅ Update docs

9. Are the pytests appropriate?

Good coverage of the new models and plugin loading, but significant gaps in the new behavior:

Well-tested:

test_plugin_set.py: Pydantic model construction, install_all(), agent compatibility, YAML deserialization, lookup methods. Solid.
test_loader.py: Happy path loading, missing files, default resolution, agent compatibility filtering. Good.
test_skills_handler.py: No-op for empty skills, install-all vs. specific skills, multiple origins, failure resilience. Good.

Not tested at all:

_configure_mcp_servers() — the entire MCP setup path (env file writing, claude mcp add command construction, dynamic dbt env discovery). This is complex code with shell commands and would benefit from at least mocked tests.
extract_tools_used() — tool extraction with _GENERIC_TOOLS filtering and Skill -> skill:name expansion. Easy to unit test with fixture JSONL data.
ClaudeCodeLogFormatter.extract_jsonl_content() — the JSONL extraction logic including synthetic prompt injection. This is non-trivial parsing logic.
generate_html_transcript() — can be tested with a mock of claude_code_transcripts.
Prompt suffix injection — no test verifying that _run_agent_with_timeout correctly concatenates the suffix.
TSV writer changes — the tools and plugin_set columns.
Harness multi-plugin-set iteration — the __ suffixed run IDs and separate output directories.

Recommendation: Prioritize tests for extract_jsonl_content() (parsing logic), extract_tools_used() (easy wins), and _configure_mcp_servers() (complex shell interaction). These are the areas most likely to regress.

Decision: 🤏 Add tests for _configure_mcp_servers only

Additional Observations

_configure_mcp_servers is coupled to Claude CLI

abstract_installed_agent.py:96 uses self.NAME.value (e.g., "claude") as the CLI binary name for mcp add. This assumes all installed agents use the same <agent_name> mcp add command syntax. Gemini CLI doesn't support this — it has
no MCP integration via CLI. The agents filter on PluginSet handles this at the config level, but the code in AbstractInstalledAgent would still try to run gemini mcp add if someone misconfigures a plugin set. Consider either:

Moving _configure_mcp_servers() to ClaudeCodeAgent specifically, or
Adding a guard that checks whether the agent supports MCP before calling it.

Decision: ❌ Do nothing

hasattr(task_agent, '_log_formatter') is a code smell

harness.py:763 uses hasattr to duck-type check for log formatter support. This is fragile — it checks for a private attribute. A cleaner approach would be an optional method on BaseAgent (e.g., supports_log_formatting() -> bool) or
making format_agent_log() and extract_tools_used() part of the BaseAgent interface with default no-op implementations. (Note: these methods already exist on BaseAgent with no-op defaults — the hasattr check is unnecessary.)

Decision: ✅ Fix to use existing no-op

The --env-file flag in MCP add command

abstract_installed_agent.py:133: The constructed command puts --env-file between the MCP command and its args. This works for uvx but depends on the runner not consuming the flag itself. If uvx interprets --env-file itself rather
than passing it to the subprocess, this could break. Worth verifying if other MCP commands are added.

Decision: ❌ Leave as-is and assume uv will continue to work as it does

- Move plugin set models from ade_bench/models/ to harness_models.py (consistent with existing convention) - Change default plugin set to 'none' only (avoid double-run footgun) - Persist prompt_suffix in TrialResults and TSV output for traceability - Replace hasattr(_log_formatter) duck-typing with BaseAgent interface methods (format_agent_log, extract_tools_used already have no-op defaults) - Standardize transcript output to single sessions/transcript.html file instead of agent-specific directory structure - Update CLAUDE.md and README.md: replace --use-mcp with --plugin-set - Add unit tests for _configure_mcp_servers (8 test cases) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

joellabes · 2026-02-09T03:24:20Z

@bstancil Do you want to give this a gander? Finally ready to rumble

clkao · 2026-02-10T04:53:54Z

I think we should have the container image names with the plugin-set in it, so runs with different set won't conflict.

Strong agree! Discussed further in #96 (comment) but I will approach that in a different PR because this one is already a monster.

created #118

Resolve conflicts from black formatting and ruff lint changes on main, keeping plugin set system from this branch (replacing use_mcp boolean). Also incorporate --log-level CLI option added on main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bstancil · 2026-02-23T14:11:51Z

Very in favor to this, as MCP was the hotness when this got built, and now skills are definitely the new hotness instead. (And honestly, I'm not sure how much longer MCP will really matter: https://trends.google.com/explore?q=claude%2520skills%2Cclaude%2520mcp&date=today%205-y&geo=Worldwide)

Also, on these things, two more general points:

Y'all should definitely feel free to move things forward without me. Happy to look if I need to, but feel free to own this, and be the ultimate deciders.
Along those lines, if there are broader rewrites of things you (or Claude) want to do, by all means. Most of this code was written 6 months ago with like, Cursor, line-by-line reviews, and Opus 4 or something. So it's entirely possible that Claude Code + Opus 4.6 will want to do things in very different ways, and if it does, who am I to disagree?

First crack at using skill

657f5a1

joellabes added 3 commits December 16, 2025 15:51

Rewriting skill text and adding prompt to force claude to use it

30dd733

toggle use-skill on and off (very ugly)

6c8cc7a

Add demo for profiling tables

3f02d71

joellabes added 3 commits December 17, 2025 15:13

More iteration on skill, add review run results example script

991a0c0

Merge branch 'main' into joellabes/add-claude-skill

bce68cf

begging claude to use the resources available to it

1a105e8

b-per mentioned this pull request Jan 16, 2026

Move evals feature to ADE-bench dbt-labs/dbt-agent-skills#12

Open

joellabes and others added 18 commits January 28, 2026 10:53

Merge branch 'main' into joellabes/add-claude-skill

d7c148e

Clean up to use vercel skills cli instead

48aeac1

add git to dockerfiles so that skill installation works

1908369

Add plugin sets design document

b897dbb

Replaces --use-mcp and --use-skills flags with YAML-configured plugin sets for A/B comparison of agent performance. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: add detailed implementation plan for plugin sets

87c3f42

12 tasks with TDD approach, exact file paths, and code snippets. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add Pydantic models for skill set configuration

733883a

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add SkillSetLoader to load and resolve skill sets

4128e2c

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add SkillsHandler for installing skills

c9394a1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add McpHandler for configuring MCP servers

a4174b7

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add skill set metadata to TrialResults model

900a9f5

Replace used_mcp field with skill_set_name, skill_set_skills, and skill_set_mcp_servers for richer experiment tracking. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: replace --use-mcp and --use-skills with --plugin-set

1bfadf0

The new --plugin-set flag accepts skill set names from skill-sets.yaml. When not specified, uses all default skill sets for A/B comparison. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: remove obsolete setup-dbt-mcp.sh

5389439

Logic has moved to McpHandler in ade_bench/plugins/mcp_handler.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove available skills callout from claude.md

f1308bb

fix: update results_writer to use plugin_set_name instead of used_mcp

4a1b211

The used_mcp field was removed when implementing the plugin set system. Replace with plugin_set_name to fix the AttributeError in CI. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

joellabes and others added 4 commits February 4, 2026 07:56

include tools used in result tsv's

11d18b8

include tools used in result tsv's

93dfda6

add support for installing a subset of skills

f62046d

joellabes marked this pull request as ready for review February 5, 2026 03:20

joellabes marked this pull request as draft February 6, 2026 01:49

clkao mentioned this pull request Feb 6, 2026

Basic plugin system allowing hooks #42

Closed

joellabes requested a review from Copilot February 9, 2026 02:08

remove plan files

498c6d9

This comment was marked as outdated.

Sign in to view

joellabes and others added 3 commits February 9, 2026 16:13

fix: escape tool names in HTML report to prevent XSS

0a56b84

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into joellabes/add-claude-skill

6604d17

joellabes marked this pull request as ready for review February 9, 2026 03:23

joellabes changed the title ~~Experimenting with Claude skills~~ Plugin system (groups of skills, MCP servers) Feb 12, 2026

joellabes and others added 5 commits February 23, 2026 13:35

fix: resolve ruff lint errors after merge

ebff3ba

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: apply black formatting to branch files

f662b59

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into joellabes/add-claude-skill

cbdd17c

Merge branch 'main' into joellabes/add-claude-skill

8d4c644

Comments

Conversation

joellabes commented Dec 15, 2025

Uh oh!

jasnonaz commented Dec 15, 2025

Uh oh!

bstancil commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joellabes commented Dec 16, 2025

Uh oh!

bstancil commented Dec 16, 2025

Uh oh!

clkao commented Feb 6, 2026

Uh oh!

joellabes commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clkao commented Feb 6, 2026

Uh oh!

clkao commented Feb 6, 2026

Uh oh!

clkao commented Feb 6, 2026

Uh oh!

joellabes commented Feb 9, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

joellabes commented Feb 9, 2026

1. Could transcript generation be more generalised for other LLMs?

2. Could log extraction happen earlier so JSONL is the only thing that comes back from Docker?

3. Are the __init__ arguments on AbstractInstalledAgent appropriate?

4. Why is ade_bench/models only present for PluginSet and init?

5. Will the plugin set concept expand cleanly for LSP, hooks, agents, etc.?

6. Are there assumptions about artifact shape that are no longer true?

7. Is there a better way to handle prompt_suffix?

8. Are there READMEs or documentation that are now out of date?

9. Are the pytests appropriate?

Additional Observations

Uh oh!

joellabes commented Feb 9, 2026

Uh oh!

clkao commented Feb 10, 2026

Uh oh!

bstancil commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bstancil commented Dec 15, 2025 •

edited

Loading

joellabes commented Feb 6, 2026 •

edited

Loading

3. Are the `init` arguments on `AbstractInstalledAgent` appropriate?

4. Why is `ade_bench/models` only present for PluginSet and init?