Skip to content

feat(skill): optimize code exec prompt and add e2e benchmark#1056

Open
fang-tech wants to merge 6 commits intoagentscope-ai:mainfrom
fang-tech:feat/skill-code-exec-prompt-e2e-benchmark
Open

feat(skill): optimize code exec prompt and add e2e benchmark#1056
fang-tech wants to merge 6 commits intoagentscope-ai:mainfrom
fang-tech:feat/skill-code-exec-prompt-e2e-benchmark

Conversation

@fang-tech
Copy link
Copy Markdown
Contributor

AgentScope-Java Version

1.0.11-SNAPSHOT

Description

fix #1050, fix #1039, fix #952

  • Code execution prompt injection: AgentSkillPromptProvider now appends a
    ## Code Execution section to the skill system prompt when code execution is enabled,
    instructing the LLM to use execute_shell_command with absolute script paths rather than
    describing or rewriting commands inline. The prompt is wired automatically via SkillBox
    on enable() and after uploadSkillFiles().

    • codeExecutionInstruction() builder option: SkillBox.CodeExecutionBuilder exposes a
      new .codeExecutionInstruction(template) method for customizing the injected prompt section.
      %s placeholders are substituted with the uploadDir absolute path (up to 5 times).

    • Remove deprecated SkillBox() no-arg constructor and JarSkillRepositoryAdapter class
      (superseded by ClasspathSkillRepository).

    • E2E benchmark suite (SkillE2ETest): 5 parameterized tests across 4 recall dimensions
      using purpose-built benchmark skills. Tests record pass/fail without individual assertions;
      @AfterAll asserts overall recall rate ≥ 75% and prints a summary table. Benchmark skills:

      • data-transform / data-report — semantic discrimination
      • image-resize — distractor resistance
      • log-parser — fuzzy trigger
      • git-changelog — code execution recall
    • Docs (EN + ZH): add configuration reference for code execution builder options and a new
      "Custom Skill Prompts" section documenting instruction, template, and
      codeExecutionInstruction.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with mvn spotless:apply
  • All tests are passing (mvn test)
  • Javadoc comments are complete and follow project conventions
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

fang-tech and others added 2 commits March 28, 2026 19:45
…truction option

- AgentSkillPromptProvider: add DEFAULT_CODE_EXECUTION_INSTRUCTION template and
  conditional append logic when code execution is enabled with an upload dir
- AgentSkillPromptProvider: expose setCodeExecutionEnable/setUploadDir/setCodeExecutionInstruction
- SkillBox.CodeExecutionBuilder: add codeExecutionInstruction() builder method
- SkillBox: wire prompt provider on enable() and after uploadSkillFiles()
- SkillBox: remove deprecated no-arg constructor

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ecall dimensions

Skills under e2e-skills/ are designed specifically for benchmarking:
- data-transform vs data-report: semantic discrimination (similar domain, different intent)
- image-resize: distractor resistance (must not fire for non-image tasks)
- log-parser: fuzzy trigger (indirect problem description should still match)
- git-changelog: code execution recall (LLM must reference deployed script path)

SkillE2ETest collects pass/fail per scenario without asserting individually;
@afterall asserts overall recall rate >= 75% and prints a summary table.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the Skill system prompt to better trigger actual code execution (via execute_shell_command with absolute paths), adds an E2E benchmark suite for skill recall/code-exec recall, removes deprecated skill-loading utilities, and updates EN/ZH docs accordingly.

Changes:

  • Append an optional “Code Execution” section to the skill system prompt when code execution is enabled, with a new codeExecutionInstruction(...) customization hook.
  • Add SkillE2ETest benchmark suite + benchmark skills under src/test/resources/e2e-skills/.
  • Remove deprecated SkillBox() no-arg constructor and deprecated JarSkillRepositoryAdapter; update documentation.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
docs/zh/task/agent-skill.md Updates ZH documentation for code execution builder options and custom skill prompts.
docs/en/task/agent-skill.md Updates EN documentation for code execution builder options and custom skill prompts.
docs/_toc.yml Adjusts docs navigation entry for the ZH “agent-skill” page.
agentscope-core/src/test/resources/e2e-skills/log-parser/scripts/extract_errors.py Adds benchmark script for log parsing/code execution recall scenarios.
agentscope-core/src/test/resources/e2e-skills/log-parser/SKILL.md Adds benchmark skill metadata/doc for log-parser.
agentscope-core/src/test/resources/e2e-skills/image-resize/scripts/resize.py Adds benchmark script for distractor resistance scenarios.
agentscope-core/src/test/resources/e2e-skills/image-resize/SKILL.md Adds benchmark skill metadata/doc for image-resize.
agentscope-core/src/test/resources/e2e-skills/git-changelog/scripts/generate_changelog.py Adds benchmark script for code execution recall scenarios.
agentscope-core/src/test/resources/e2e-skills/git-changelog/SKILL.md Adds benchmark skill metadata/doc for git-changelog.
agentscope-core/src/test/resources/e2e-skills/data-transform/scripts/json_to_csv.py Adds benchmark script for semantic discrimination scenarios.
agentscope-core/src/test/resources/e2e-skills/data-transform/scripts/csv_to_json.py Adds benchmark script for semantic discrimination scenarios.
agentscope-core/src/test/resources/e2e-skills/data-transform/SKILL.md Adds benchmark skill metadata/doc for data-transform.
agentscope-core/src/test/resources/e2e-skills/data-report/scripts/summarize.py Adds benchmark script for semantic discrimination scenarios.
agentscope-core/src/test/resources/e2e-skills/data-report/SKILL.md Adds benchmark skill metadata/doc for data-report.
agentscope-core/src/test/java/io/agentscope/core/skill/SkillBoxTest.java Adds integration tests ensuring prompt injection behavior when code execution is enabled.
agentscope-core/src/test/java/io/agentscope/core/skill/AgentSkillPromptProviderTest.java Adds unit tests for code execution prompt section behavior and customization.
agentscope-core/src/test/java/io/agentscope/core/e2e/SkillE2ETest.java Adds concurrent-gated E2E benchmark tests for skill recall + code execution recall.
agentscope-core/src/main/java/io/agentscope/core/skill/util/JarSkillRepositoryAdapter.java Removes deprecated adapter (superseded by ClasspathSkillRepository).
agentscope-core/src/main/java/io/agentscope/core/skill/SkillBox.java Removes deprecated no-arg constructor; wires code-exec prompt settings from builder.
agentscope-core/src/main/java/io/agentscope/core/skill/AgentSkillPromptProvider.java Implements conditional code execution section injection + customization plumbing.

fang-tech and others added 4 commits March 28, 2026 20:18
…chmark scripts

- Fix AgentSkillPromptProviderTest: update assertion from "existing scripts"
  to "existing Python script" to match actual prompt template text
- Add Apache 2.0 license headers to all 6 e2e benchmark Python scripts

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…chmark scripts

- Fix AgentSkillPromptProviderTest: update assertion from "existing scripts"
  to "existing Python script" to match actual prompt template text
- Add Apache 2.0 license headers to all 6 e2e benchmark Python scripts

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

❌ Patch coverage is 85.71429% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...gentscope/core/skill/AgentSkillPromptProvider.java 85.71% 0 Missing and 2 partials ⚠️
...c/main/java/io/agentscope/core/skill/SkillBox.java 85.71% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants