Skip to content

refactor(bugfix): convert commands to skills, add CLAUDE.md, clean up#42

Open
jwm4 wants to merge 28 commits intoambient-code:mainfrom
jwm4:feature/bugfix-workflow-skills-refactor
Open

refactor(bugfix): convert commands to skills, add CLAUDE.md, clean up#42
jwm4 wants to merge 28 commits intoambient-code:mainfrom
jwm4:feature/bugfix-workflow-skills-refactor

Conversation

@jwm4
Copy link
Contributor

@jwm4 jwm4 commented Feb 10, 2026

Bugfix workflow: skill-based architecture with workflow controller

Problem

The bugfix workflow had all process logic inline in command files. This meant users only got the workflow's carefully designed process if they typed the exact slash command — saying "fix this bug" in natural language would bypass all of it. Phase transitions were unreliable, the agent would auto-advance without pausing, and the PR creation process failed in predictable ways that weren't handled.

Changes

Architecture: commands → skills → controller

Refactored from a command-heavy design to a three-layer architecture:

  • Controller skill (.claude/skills/controller/SKILL.md) — owns all phase transitions, recommendations, and flow logic. The single source of truth for "what happens next."
  • Phase skills (.claude/skills/{name}/SKILL.md) — contain the full multi-step process for each phase. Report findings only; never recommend next steps.
  • Command wrappers (.claude/commands/*.md) — thin pointers (~5 lines each) that route to the controller.

This means users get the workflow's benefit whether they type /fix or "fix this bug."

New phases:

  • /assess — initial phase that reads the bug report, summarizes understanding, identifies gaps, and proposes a plan before jumping into reproduction
  • /review — optional phase after /test that critically evaluates the fix and tests, looking for gaps and missed edge cases
  • /pr — systematic, failure-resistant pull request creation with fork workflows, bot identity handling, and a 4-rung fallback ladder

New files:

  • workflows/bugfix/CLAUDE.md — behavioral guidelines (principles, hard limits, safety, quality, escalation)
  • docs/adr/2026-02-12-bugfix-lessons-learned.md — 16 lessons learned about prompt engineering for multi-phase AI workflows

Deleted files:

  • .claude/agents/amber.md — redundant with systemPrompt; useful content moved to CLAUDE.md
  • AMBER_RESEARCH.md, FIELD_REFERENCE.md, TEMPLATE_FEEDBACK.md — supplementary docs not used by the workflow
  • .ambient/ambient.clean.json — unused duplicate

Simplified systemPrompt:

Reduced from ~100 lines to ~10 lines. The systemPrompt now just identifies the agent, points at the controller, and lists the workspace layout. All behavioral rules live in CLAUDE.md; all process logic lives in skills.

Key design decisions

  • Skills trigger controller re-reads. Each skill ends with "Then re-read the controller for next-step guidance." This is the mechanism that prevents the agent from getting stuck in the previous skill's context after a phase completes.
  • Controller owns all recommendations. Skills report findings; the controller decides what to suggest next. This prevents skills from recommending skipping phases (e.g., fix → PR, skipping test and review).
  • Flexible recommendations over rigid routing. The controller has guidance for skipping forward, going back, and ending early — not a fixed phase-to-phase table.
  • PR skill handles bot identity. When running as a GitHub App, the skill resolves the real user via /installation/repositories, uses the correct jq fields for fork detection, and treats gh pr create 403 as an expected outcome (provides a GitHub compare URL) rather than an error to debug.

Testing

Iteratively tested on the Ambient Code Platform across multiple sessions. The ADR documents all issues encountered and how they were resolved.

jwm4 added 28 commits February 10, 2026 15:36
- Convert all 5 commands (reproduce, diagnose, fix, test, document) into
  detailed skills in .claude/skills/{name}/SKILL.md
- Rewrite commands as thin wrappers that delegate to skills, passing
  arguments and session context
- Add CLAUDE.md with behavioral guidelines extracted from amber.md
  (engineering discipline, safety guardrails, quality standards)
- Remove .claude/agents/amber.md (redundant with systemPrompt persona)
- Remove supplementary files (AMBER_RESEARCH.md, FIELD_REFERENCE.md,
  TEMPLATE_FEEDBACK.md)
- Update systemPrompt to reference skills alongside commands
- Update README to reflect new structure
- Add .claude/skills/pr/SKILL.md with systematic fork-based PR workflow
- Add .claude/commands/pr.md thin wrapper
- Handles pre-flight checks, fork setup, branch/commit/push, and PR creation
- Includes error recovery for common failures (no push access, sandbox
  restrictions, missing fork)
- Update systemPrompt and startupPrompt to include /pr as phase 6
- Update README with PR phase documentation
Make CLAUDE.md self-contained by including workflow phases, commands,
skill locations, and artifact path directly instead of referencing
ambient.json (which is platform plumbing, not model-visible).

Also fix markdown lint warnings (emphasis-as-heading, code block language).
- Add .claude/skills/review/SKILL.md — critically evaluates fix and tests,
  issues verdict (inadequate fix / incomplete tests / solid), recommends
  next steps
- Add .claude/commands/review.md thin wrapper
- Update /test skill to report verification.md path and summarize results
  inline to the user
- Update systemPrompt, startupPrompt, CLAUDE.md, README with /review as
  optional phase 5 between test and document
- systemPrompt: add PHASE TRANSITIONS section requiring agent to stop
  and summarize at end of each phase instead of auto-advancing
- CLAUDE.md: add Flow Control section reinforcing pause behavior,
  trim to ~70 lines per best practices
- pr/SKILL.md: add placeholders table defining GH_USER, FORK_OWNER,
  UPSTREAM_OWNER/REPO, REPO, and BRANCH_NAME with sources
- pr/SKILL.md: rewrite Step 2 to ask user before creating fork,
  wait for confirmation, never silently skip ahead
- pr/SKILL.md: replace flat fallback with 4-rung fallback ladder,
  patch file is now absolute last resort
- ambient.json: remove rubric (deferred to future PR),
  remove startupPrompt and results fields
- Delete ambient.clean.json (no longer needed)
- AGENTS.md: add sandbox restrictions table
- All command wrappers now say 'Read the file ... now and follow it step
  by step' instead of 'Using the X skill from ...' which the agent was
  interpreting as conceptual guidance rather than a file-read instruction
- systemPrompt: add CRITICAL prefix to skill-reading instruction, explicitly
  say 'use the Read tool to open the referenced SKILL.md file'
- pr/SKILL.md: add 'IMPORTANT: Follow This Skill Exactly' section at top
  with anti-improvisation rules
- pr/SKILL.md: add pre-flight summary checkpoint between Steps 1 and 2
- pr/SKILL.md: upgrade 'no fork' path from 'STOP' to 'HARD STOP' with
  explicit instructions to wait for user twice (before fork attempt,
  after fork attempt fails)
- pr/SKILL.md: add fork verification command before proceeding to Step 3
- Add critical rules: never attempt gh repo fork without asking user first,
  never fall back to patch files without exhausting all other options
- systemPrompt: add step to tell user which skill is being invoked
  before reading it, so user can correct if wrong phase was picked
- CLAUDE.md: reinforce skill announcement policy
- Step 1a: don't dead-end on missing gh auth — continue pre-flight to
  gather info from git, then present options
- Step 1b: add fallback git config when gh is unavailable
- Step 1d: add git-remote-based fallback for upstream repo identification
- Add decision point after pre-flight: present 3 clear options to user
  (set up auth, provide fork URL, or prepare for manual push) and WAIT
  for response instead of dumping all manual instructions at once
Every skill now ends with a 'When This Phase Is Done' section that:
- Tells the agent to summarize findings to the user
- Recommends the natural next step in the workflow
- Explicitly says to stop and wait for the user

This reinforces the phase-transition pause from the systemPrompt and
CLAUDE.md at the skill level, so the agent gets the instruction to stop
regardless of whether it's following the systemPrompt guidance or
reading the skill directly.

Next-step recommendations follow the natural flow:
reproduce → diagnose → fix → test → review → document → pr
- New /assess phase (phase 1): reads the bug report, presents understanding,
  identifies gaps, proposes reproduction plan, and waits for user confirmation
  before taking any action. No code execution or cloning in this phase.
- systemPrompt: updated to 8-phase flow starting with ASSESS, added explicit
  instruction to start with assess when user provides a bug report
- CLAUDE.md: updated phase list to include assess
- pr/SKILL.md: clarified that 'never push to upstream' applies even to org
  bots and GitHub Apps — no exceptions based on account type
- pr/SKILL.md: added reminder at pre-flight summary not to skip Step 2
- systemPrompt: scope 'start with ASSESS' to initial bug report only
- systemPrompt: add 'INTERPRETING USER RESPONSES' section so agent
  maps phase names and confirmations to the correct next phase
- assess skill: allow cloning and reading code (but not executing)
- assess skill: add Step 2 to check for repo and clone if missing,
  read referenced PRs/files to inform assessment
- assess skill: update output to note repo is cloned for later phases
- Add awareness of remaining phases (review, document, pr) so agent
  recommends the full path instead of jumping to PR
… via Task tool

Instead of the orchestrator reading and executing skill files directly
(which caused context bleed and skill-selection failures), Amber now
dispatches each phase to a subagent using the Task tool.

Changes:
- systemPrompt declares Amber as an orchestrator with dispatch protocol
- Command wrappers instruct dispatch via Task tool (not direct execution)
- Skill 'When This Phase Is Done' sections return results (not user-facing)
- CLAUDE.md reinforces orchestrator role and subagent dispatch pattern
Replace systemPrompt-based orchestration with a dedicated controller skill
(.claude/skills/controller/SKILL.md) that the model reads and follows.

The controller has:
- Phase table mapping commands to skill paths
- Explicit dispatch protocol via Task tool
- User response interpretation table (yes → last recommendation, not repeat)
- Critical rule: never re-dispatch the phase that just finished

systemPrompt is now minimal: identity + pointer to controller.
Command wrappers all route through the controller.
CLAUDE.md references the controller instead of restating flow rules.
…subagents

Switch from Task tool dispatch to direct execution so users see full
progress in the platform UI. The controller re-reads itself after each
phase to keep transition rules fresh and prevent getting stuck.

This preserves the correct transition behavior while restoring visibility.
Move all next-step logic out of individual skills and into the controller.
Skills now only report findings; the controller decides what to recommend.

Recommendations are flexible: the controller considers skipping forward,
going back, or ending early based on what actually happened, and presents
multiple options instead of a single rigid next step.
… descriptions

- Remove next-step recommendations from all 8 skills (they just report findings)
- Controller owns all recommendation logic with flexible guidance
- Add one-sentence descriptions to each phase in the controller
- Remove response-interpretation table (model handles this naturally)
…ncements

- systemPrompt and CLAUDE.md: re-read controller when choosing/starting a phase
- Controller: explicit phase announcement with example before execution
- Avoids re-reading on every message (context budget concern)
- pr/SKILL.md: use .parent.owner.login and .parent.name instead of
  non-existent .parent.nameWithOwner in fork detection query
- ADR: add lesson 13 (test API responses against real data), lesson 5
  (controller re-read timing), lesson 6 (phase descriptions improve
  recommendations); update lesson 2 (drop response-interpretation table);
  renumber to 15 total lessons
…ontroller

- pr/SKILL.md: recognize bot 403 as expected, use GitHub compare URL for
  manual PR creation instead of bare fork URL, prevent patch-file spiral
- review/SKILL.md: remove A/B/C verdict labels (meaningless to users)
- controller/SKILL.md: add 'continue to next step' as default recommendation,
  clarify recommendation timing
- ADR: add lesson 14 (GitHub App on user ≠ permission on upstream), now 16 total
- All 8 skills now end with 'Then re-read the controller for next-step guidance'
- Removed redundant re-read instructions from controller, systemPrompt, CLAUDE.md
- Controller Step 4 now acknowledges the skill-triggered re-read instead of duplicating it
- Net effect: one place triggers the re-read (end of skill), one place handles it (controller)
@jwm4 jwm4 marked this pull request as ready for review February 14, 2026 18:58
@jeremyeder
Copy link
Contributor

This is solid, I'll load it as a custom workflow and see how we do. Thanks!

@jwm4
Copy link
Contributor Author

jwm4 commented Feb 15, 2026

Oh, I forgot to mention this in the PR description. Here are three things that I think should be in this workflow but are not in this PR because I wanted to break up the work a little:

  1. YOLO mode, i.e., some way to tell the model to just run the entire workflow and without stopping and asking questions. I don't think this is hard to do, but I do think it will take some iterating to get it to work well because the current workflow has pretty strong guidance to not YOLO, getting it to YOLO only when it's supposed to (and making sure it doesn't forget whether it's in YOLO mode or not) might be a little tricky. I have some thoughts on how to do it, but I want to take the time to get it right.
  2. Context7 and/or DeepWiki MCP servers: Since these are free hosted servers that don't require any credentials or set up, it would be super easy to include them in the workflow I think. I haven't tried that out, but I think it would work. I would be inclined to start with just Context7 to start because I've gotten a lot of good results from that one and furthermore it's usually better to limit how many different options you give a model. The workflow is fairly complex already so if we're going to add an extra tools I would want to do so sparingly.
  3. The Ambient-specific rubric mechanism: Right now the workflow has a skill for assessing the quality of a fix. That skill could probably be complimented by having a formal rubric schema in ambient.json and a corresponding rubric.md, but I don't really have good intuition as to what guidance would go in the skill versus what guidance would go in the rubric.md versus what would be implicit in the schema. I also don't really understand what's being done with either the schema or the rubric.md. So I think a lot more work would be needed to figure out how to make this skill and the Ambient-specific rubric mechanism work well together. I'm also a little concerned that if we do this and we wind up limiting the scope of the skill because it's covered well through the Ambient-specific rubric mechanism then the workflow becomes less portable, i.e., you can't easily just drop it into Claude Code because the skill is specifically designed to complement the Ambient-specific rubric mechanism and thus not to work well as a standalone skill. So I'm a little on the fence about whether this is even a good idea.

@kami619
Copy link

kami619 commented Feb 19, 2026

@jeremyeder and @jwm4 : The workflow itself worked perfectly well - I used it to run the bugfix for this specific PR ambient-code/agentready#306 to address the ambient-code/agentready#302 using the updated bugfix workflow as a Custom Workflow based on Jeremy's suggestion.

Here is the summary of the session and where it can be better.

The agent (Amber) successfully navigated a structured, 8-phase bugfix workflow (/assess through /pr) to resolve GitHub Issue #302 in the agentready repository. The bug involved the application ignoring the excluded_attributes field in user configuration files. The agent diagnosed the root cause, implemented the fix, wrote regression tests, generated documentation, and successfully iterated through three rounds of user/CI feedback to polish the pull request.

Workflow Breakdown

Assess & Reproduce: The agent analyzed the issue and codebase, identifying that Config.is_excluded() was never actually called. It set up a Python 3.12 virtual environment and successfully reproduced the bug.

Diagnose: Using git log and git blame, the agent traced the history, discovering that a CLI --exclude flag was added in a later commit without being integrated with the existing configuration file logic.

Fix & Test: The agent modified cli/main.py to merge CLI and config exclusions into a single set before filtering. It then added three new regression tests to test_main.py using Python's unittest.mock.

Document & PR: The agent automatically generated comprehensive artifacts (release notes, changelog, team announcement, and PR description) and pushed the branch to the user's fork, providing a direct link to open the PR.

Iterative Refinement:

  • Iteration 1: Fixed black formatting issues in the test docstrings pointed out by the user.

  • Iteration 2: Addressed a CI failure (an E2E test broke because the fix successfully started validating config files, catching an invalid mock attribute repomix_config in an old test). Also updated error handling to differentiate CLI errors from config errors based on an automated PR review.

  • Iteration 3: Fixed fragile mock introspection (call_args[0][0] to call_args.args[0]), consolidated error messages, and swapped capsys for CliRunner based on a secondary review.

Areas for Improvement

While the agent ultimately produced a high-quality fix, there were several inefficiencies and oversights during the session:

  • Missed Full Test Suite Run Post-Fix: During the /test phase, the agent ran the specific CLI unit tests and verified they passed, but it failed to run the E2E test suite (tests/e2e/test_critical_paths.py). Had it run the full suite locally, it would have immediately caught the broken test_assess_with_valid_config test before pushing to CI.

  • Incorrect Formatting Sequence: The agent responsibly ran linters (black, isort, ruff) during the /fix phase on main.py. However, it forgot to re-run black after adding the regression tests in the /test phase. This resulted in the user having to manually point out formatting violations via screenshots.

  • Clunky Environment Setup: During the /reproduce phase, the agent took 6-7 shell attempts to get the environment working. It stumbled over Python version mismatches (3.11 vs. 3.12 required), uv command syntax, and missing pip executables before finally getting the virtual environment established. A more optimized agent would check pyproject.toml constraints first and execute a single, correct bootstrap command.

  • Initial Blind Spot on UX/Error Handling: The agent's initial fix blindly threw a click.BadParameter error for both CLI and config exclusions. It took an external automated review to point out that throwing a CLI-specific error for a config file issue is a poor user experience. The agent should have proactively recognized the context difference and used click.ClickException from the start.

  • Use of Legacy/Fragile Testing Patterns: The agent defaulted to older, more fragile testing patterns—specifically using tuple indexing for mock arguments (call_args[0][0]) and using capsys to capture output instead of utilizing Click's native CliRunner which was already present in the test suite fixtures.

@jeremyeder I have used the export chat feature from the ACP GUI, like you showed and put that into another LLM, voila, now we have actionable improvements to be plugged back in. I can clearly see the path for automated self-learning just by pointing the Agent from the active session back to this summary and learn from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments