diff --git a/README.md b/README.md index 3c112a0..deca962 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,8 @@ Works with **Claude Code**, **Codex**, and any agent supporting the Agent Skills Most skill auditors only do static checks on your SKILL.md. This one also mines your actual session transcripts to measure trigger rates, user satisfaction, workflow completion, and undertrigger gaps — then scores each skill on a 5-point composite scale. +It also works in a **source-repository audit mode** for public skill collections that are still building usage evidence. In that mode it uses validators, example prompts, validation logs, CI, and maintainer sessions as secondary evidence, while clearly marking trigger-rate and reaction findings as low-confidence or `N/A` when live routing data is missing. + ## What It Does **6 scored dimensions** (weighted into composite score): @@ -29,6 +31,24 @@ Most skill auditors only do static checks on your SKILL.md. This one also mines | **Cross-Skill Conflicts** | Trigger keyword overlap and contradictory guidance between skills | | **Environment Consistency** | Broken file paths, missing CLI tools, non-existent directories | +## Audit Modes + +### Installed Skill Mode + +Use this when the skills are already installed under `~/.claude/skills/`, `~/.codex/skills/`, or `~/.agents/skills/`. This is the highest-confidence mode because invocation evidence can come directly from session transcripts. + +### Source Repository Mode + +Use this when you are auditing a skill repository before or alongside publication. The optimizer can still run all 8 dimensions, but it treats repo validators, example prompts, validation logs, review checklists, and CI as fallback evidence rather than pretending they are the same as live routing telemetry. + +## When Routing Eval Is Necessary + +Routing-eval or transcript evidence is not equally urgent for every repository. + +- Treat it as `P0` when the repository claims the skills are already routing-proven, production-ready, or validated in real agent use. +- Treat it as `P1` or next-milestone evidence work when the repository is honestly positioned as docs-first, draft, or beta and already separates proven behavior from future validation goals. +- Treat it as optional only when the repository is not making routing claims at all and the audit goal is purely static cleanup. + ## Installation Copy the command below and paste it directly into your agent's chat — it will install automatically: @@ -71,6 +91,15 @@ cp -r /tmp/skill-optimizer/skills/skill-optimizer ~/.agents/skills/ rm -rf /tmp/skill-optimizer ``` +```powershell +# Windows PowerShell example for Codex +$target = Join-Path $env:TEMP 'skill-optimizer' +git clone https://github.com/hqhq1025/skill-optimizer.git $target +New-Item -ItemType Directory -Force -Path "$HOME\\.codex\\skills" | Out-Null +Copy-Item -Recurse -Force "$target\\skills\\skill-optimizer" "$HOME\\.codex\\skills\\" +Remove-Item -Recurse -Force $target +``` + ## Usage @@ -98,6 +127,15 @@ The optimizer auto-detects available platforms and scans session data from all o | Codex | `~/.codex/skills/` | `~/.codex/sessions/**/*.jsonl` | | Shared | `~/.agents/skills/` | — | +For Codex, skill loading in `base_instructions` is not enough to prove actual use. The optimizer looks for workflow markers or explicit prompt/result evidence before counting an invocation. + +When auditing a source repository instead of an installed skill directory, the optimizer can also use: + +- repo-owned validators +- `references/` files and `agents/openai.yaml` +- example prompts and validation logs +- CI workflows and forward-test records + ## Research Background The analysis dimensions are grounded in peer-reviewed research: @@ -125,6 +163,58 @@ Compute composite scores (weighted average of 6 scored dimensions) Output report with P0/P1/P2 prioritized fixes ``` +When session data is sparse, the optimizer still runs all 8 dimensions and explicitly marks any unsupported metrics as `N/A` instead of fabricating a score. + +## Example Repository Audit + +Example: a docs-first miniapp skill repository contains 4 public skills, passes validators and CI, has example prompts plus forward-test notes, but only has maintainer sessions instead of clean installed-skill routing transcripts. + +The correct audit result is: + +- score static quality and progressive disclosure normally +- use validation logs as medium-confidence workflow evidence +- mark trigger rate, user reaction, and undertrigger as low-confidence or `N/A` +- report missing routing transcripts as the highest-priority `P1` for the next maturity step, not as a false `P0`, unless the repo is already claiming routing proof + +## Example Output + +```markdown +# Skill Optimization Report +**Date**: 2026-03-30 +**Scope**: all public skills in `miniprogram_skills` +**Evidence**: validator pass, CI, validation log, 8 maintainer sessions +**Confidence**: static=high, workflow=medium, routing=low +**Release stage**: docs-first public beta + +## Overview +| Skill | Trigger | Reaction | Completion | Static | Undertrigger | Token | Score | +|-------|---------|----------|------------|--------|--------------|-------|-------| +| miniapp-devtools-cli-repair | N/A | N/A | strong | strong | N/A | strong | 4/5 | +| miniapp-devtools-gui-check | N/A | N/A | strong | strong | N/A | strong | 4/5 | + +## P0 Fixes +None from the current evidence set. + +## P1 Improvements +1. Add one installed-skill transcript or replayable routing eval per public skill. +2. Add negative-path validation for adjacent skill boundaries. + +## Milestone Fit +- current-milestone blockers: none beyond already-declared beta limits +- next-milestone evidence work: transcript-backed routing proof + +## Per-Skill Diagnostics +### miniapp-devtools-gui-check +#### 4.1 Trigger Rate +N/A — insufficient live routing evidence +#### 4.3 Workflow Completion +Strong. Validation logs show one narrow host-side route check reaching a real report. +#### 4.6 Cross-Skill Conflicts +Moderate but controlled. Primary overlap is with miniapp-devtools-cli-repair. +``` + +This is the intended behavior for source-repository mode: keep the report honest, keep all 8 dimensions, and avoid overstating routing quality when the available evidence is mostly static or curated. + **Scored dimensions (weighted average):** - Trigger rate: 25% - User reaction: 20% diff --git a/README.zh-CN.md b/README.zh-CN.md index b4e6327..2d5f7b2 100644 --- a/README.zh-CN.md +++ b/README.zh-CN.md @@ -8,6 +8,8 @@ 大多数 skill 审计工具只做 SKILL.md 的静态检查。这个工具还会挖掘你的真实 session 记录,量化触发率、用户满意度、workflow 完成率和漏触发缺口,最终为每个 skill 打出 5 分制综合评分。 +它也支持一种 **source repository 审计模式**:当一个公开 skill 仓库还在积累真实使用证据时,可以退回到 validator、example prompts、validation log、CI 和维护者 session 这些次级证据,同时明确把触发率、用户反应之类缺少真实路由样本的维度标成低置信度或 `N/A`。 + ## 功能 **6 个评分维度**(加权计入综合分): @@ -29,6 +31,24 @@ | **跨 Skill 冲突** | 触发关键词重叠和 skill 间矛盾指导 | | **环境一致性** | 文件路径失效、CLI 工具缺失、目录不存在 | +## 审计模式 + +### 已安装 Skill 模式 + +当目标 skill 已经装在 `~/.claude/skills/`、`~/.codex/skills/` 或 `~/.agents/skills/` 下时,优先使用这个模式。它的置信度最高,因为可以直接从 session transcript 里寻找真实调用证据。 + +### Source Repository 模式 + +当你审计的是一个准备公开、还没积累足够 live routing 数据的 skill 仓库时,使用这个模式。优化器仍然会跑完整 8 维,但会把 repo validator、example prompts、validation log、review checklist 和 CI 当成 fallback evidence,而不会假装它们等同于真实路由遥测。 + +## 什么时候必须补 Routing Eval + +routing-eval / transcript 证据并不是对每个仓库都同等紧急。 + +- 当仓库声称这些 skills 已经 routing-proven、production-ready,或者已经在真实 agent 使用里验证过时,这就是 `P0`。 +- 当仓库明确把自己定位成 docs-first、draft 或 beta,而且已经诚实地区分“已证明”和“待验证”时,这更适合作为 `P1` 或下一阶段证据工作。 +- 当审计目标只是做静态清理、仓库本身也没有做任何路由成熟度承诺时,这项工作可以先不排在最前面。 + ## 安装 复制下面的指令,直接粘贴到你的 agent 对话中即可自动安装: @@ -71,6 +91,15 @@ cp -r /tmp/skill-optimizer/skills/skill-optimizer ~/.agents/skills/ rm -rf /tmp/skill-optimizer ``` +```powershell +# Windows PowerShell 示例(Codex) +$target = Join-Path $env:TEMP 'skill-optimizer' +git clone https://github.com/hqhq1025/skill-optimizer.git $target +New-Item -ItemType Directory -Force -Path "$HOME\\.codex\\skills" | Out-Null +Copy-Item -Recurse -Force "$target\\skills\\skill-optimizer" "$HOME\\.codex\\skills\\" +Remove-Item -Recurse -Force $target +``` + ## 使用 @@ -98,6 +127,15 @@ rm -rf /tmp/skill-optimizer | Codex | `~/.codex/skills/` | `~/.codex/sessions/**/*.jsonl` | | 共享 | `~/.agents/skills/` | — | +对于 Codex,`base_instructions` 里出现 skill 被加载,并不等于 skill 真的被调用。优化器会继续寻找 workflow marker 或明确的 prompt/result 证据,再把它计作一次 invocation。 + +如果审计的是 source repository,而不是已经安装到本地目录的 skill,优化器还可以读取: + +- 仓库自带 validator +- `references/` 文件和 `agents/openai.yaml` +- example prompts 与 validation log +- CI workflow 与 forward-test 记录 + ## 研究背景 分析维度基于同行评审的学术研究: @@ -125,6 +163,58 @@ rm -rf /tmp/skill-optimizer 输出 P0/P1/P2 优先级修复报告 ``` +当 session 数据不足时,优化器仍然会坚持跑完整 8 个维度,并把证据不足的指标明确标成 `N/A`,而不是伪造分数。 + +## 示例仓库审计 + +示例:一个 docs-first 的小程序 skill 仓库里有 4 个 public skills,validator 和 CI 都通过,也有 example prompts、validation log 和 forward-test 记录,但历史 session 主要还是维护者在建设仓库,而不是已经安装好的 skills 在真实对话里被稳定路由。 + +这时正确的审计结论应该是: + +- 静态质量和渐进式加载可以正常打分 +- workflow completion 可以把 validation log 当作中等置信度证据 +- 触发率、用户反应、漏触发要标成低置信度或 `N/A` +- 缺少 routing transcript 应该被列成“下一阶段最重要的 `P1`”,而不是机械地判成 `P0`;除非仓库已经对外宣称自己有真实路由证明 + +## 示例输出 + +```markdown +# Skill Optimization Report +**Date**: 2026-03-30 +**Scope**: `miniprogram_skills` 里的全部 public skills +**Evidence**: validator 通过、CI、validation log、8 条维护者 session +**Confidence**: static=high, workflow=medium, routing=low +**Release stage**: docs-first public beta + +## Overview +| Skill | Trigger | Reaction | Completion | Static | Undertrigger | Token | Score | +|-------|---------|----------|------------|--------|--------------|-------|-------| +| miniapp-devtools-cli-repair | N/A | N/A | strong | strong | N/A | strong | 4/5 | +| miniapp-devtools-gui-check | N/A | N/A | strong | strong | N/A | strong | 4/5 | + +## P0 Fixes +当前证据集下无 P0。 + +## P1 Improvements +1. 给每个 public skill 补 1 条 installed-skill transcript 或可重放 routing eval。 +2. 给相邻 skill 边界补负路径验证。 + +## Milestone Fit +- current-milestone blockers: 除仓库已声明的 beta 限制外,无新增阻塞 +- next-milestone evidence work: transcript-backed routing proof + +## Per-Skill Diagnostics +### miniapp-devtools-gui-check +#### 4.1 Trigger Rate +N/A — 缺少足够的 live routing evidence +#### 4.3 Workflow Completion +Strong。validation log 证明过一次真实的窄路由宿主机检查并生成报告。 +#### 4.6 Cross-Skill Conflicts +Moderate but controlled。主要重叠对象是 miniapp-devtools-cli-repair。 +``` + +这就是 source-repository 模式下期望的输出风格:保留完整 8 维,诚实标注证据等级,不因为缺少 live routing 数据就伪造一个看起来很确定的结论。 + **评分维度(加权平均):** - 触发率:25% - 用户反应:20% diff --git a/skills/skill-optimizer/SKILL.md b/skills/skill-optimizer/SKILL.md index 974d4a7..17cc7f8 100644 --- a/skills/skill-optimizer/SKILL.md +++ b/skills/skill-optimizer/SKILL.md @@ -5,258 +5,39 @@ description: "Use when the user wants to analyze, audit, or improve their Agent ## Rules -- **Read-only**: never modify skill files. Only output report. -- **All 8 dimensions**: do not skip any. If data is insufficient, report "N/A — insufficient session data" rather than omitting. -- **Quantify**: "you had 12 research tasks last week but the skill never triggered" beats "you often do research". -- **Suggest, don't prescribe**: give specific wording suggestions for description improvements, but frame as suggestions. -- **Show evidence**: for undertrigger claims, quote the actual user message that should have triggered the skill. -- **Evidence-based suggestions**: when suggesting description rewrites, cite the specific research finding that motivates the change (e.g., "front-load trigger keywords — MCP study shows 3.6x selection rate improvement"). +- Audit all 8 dimensions. If data is missing, report `N/A — insufficient session data` instead of inventing a number. +- Default to read-only audit mode, but if the user explicitly asks to improve the skill or skill repository after the audit, you may apply the agreed fixes. +- Prefer real session transcripts. When auditing a source repository or docs-first skill collection with sparse invocation history, fall back to validator output, example prompts, validation logs, CI, and review checklists. Label those conclusions as lower confidence. +- For Codex, skill loading is not the same as skill invocation. Look for workflow markers, explicit prompt/result evidence, or repo-owned validation records before claiming the skill was actually used. +- Calibrate severity against repository maturity and public claims. Missing live routing evidence is a true P0 only when the repository claims the skills are routing-proven, production-ready, or already validated in real agent use. For honest docs-first, draft, or beta repositories, report it as the highest-priority P1 or next-milestone evidence gap instead. +- Distinguish a true workflow leak from a short action anchor in the description. Brief action language is acceptable when it separates adjacent skills; flag only step-by-step execution detail or output-format leakage in frontmatter. +- Use the native shell for the host OS. Prefer PowerShell on Windows and a POSIX shell elsewhere; do not require Bash if it is unavailable. +- Quantify with evidence whenever possible, and cite the research basis for rewrite suggestions. ## Overview -Analyze skills using **historical session data + static quality checks**, output a diagnostic report with P0/P1/P2 prioritized fixes. Scores each skill on a 5-point composite scale across 8 dimensions. +Analyze skills with **session data + static quality checks** and output a prioritized report with P0/P1/P2 fixes. This skill supports both installed-skill audits and source-repository audits for public skill collections that are still building usage evidence. -CSO (Claude/Agent Search Optimization) = writing skill descriptions so agents select the right skill at the right time. This skill checks for CSO violations. +## Quick Start -## Usage +1. Read `references/session-analysis.md` and collect the strongest available evidence set. +2. Read `references/audit-framework.md` and run all 8 dimensions. +3. Read `references/report-template.md` and deliver the report in that shape. -- `/optimize-skill` → scan all skills -- `/optimize-skill my-skill` → single skill -- `/optimize-skill skill-a skill-b` → multiple specified skills +## Output Format -## Data Sources +When answering: -Auto-detect the current agent platform and scan the corresponding paths: +1. state the scope, evidence set, and confidence level +2. show an overview table with `N/A` where data is missing +3. list P0, P1, and P2 findings +4. provide all 8 dimensions for each audited skill +5. separate findings about the audited skills from findings about the optimizer itself if both are in scope +6. if relevant, distinguish blockers for the repository's current milestone from evidence work needed for the next maturity step -| Source | Claude Code | Codex | Shared | -|--------|------------|-------|--------| -| Session transcripts | `~/.claude/projects/**/*.jsonl` | `~/.codex/sessions/**/*.jsonl` | — | -| Skill files | `~/.claude/skills/*/SKILL.md` | `~/.codex/skills/*/SKILL.md` | `~/.agents/skills/*/SKILL.md` | +## Resources -**Platform detection:** Check which directories exist. Scan all available sources — a user may have both Claude Code and Codex installed. - -## Workflow - -``` -Identify target skills - ↓ -Collect session data (python3 scripts scan JSONL transcripts) - ↓ -Run 8 analysis dimensions - ↓ -Compute composite scores - ↓ -Output report with P0/P1/P2 -``` - -### Step 1: Identify Target Skills - -Scan skill directories in order: `~/.claude/skills/`, `~/.codex/skills/`, `~/.agents/skills/`. Deduplicate by skill name (same name in multiple locations = same skill). For each, read `SKILL.md` and extract: -- name, description (from YAML frontmatter) -- trigger keywords (from description field) -- defined workflow steps (Step 1/2/3... or ### sections under Workflow) -- word count - -If user specified skill names, filter to only those. - -### Step 2: Collect Session Data - -Use python3 scripts via Bash to scan session JSONL files. Extract: - -**Claude Code sessions** (`~/.claude/projects/**/*.jsonl`): -- `Skill` tool_use calls (which skills were invoked) -- User messages (full text) -- Assistant messages after skill invocation (for workflow tracking) -- User messages after skill invocation (for reaction analysis) - -**Codex sessions** (`~/.codex/sessions/**/*.jsonl`): -- `session_meta` events → extract `base_instructions` for skill loading evidence -- `response_item` events → assistant outputs (workflow tracking) -- `event_msg` events → tool execution and skill-related events -- User messages from `turn_context` events (for reaction analysis) - -**Note:** Codex injects skills via context rather than explicit `Skill` tool calls. Skill loading (present in `base_instructions`) does NOT equal active invocation. To detect actual use, search for skill-specific workflow markers (step headers, output formats) in `response_item` content within that session. A skill is "invoked" only if the agent produced output following the skill's defined workflow. - -**Aggregated:** -- Per-skill: invocation count, trigger keyword match count -- Per-skill: user reaction sentiment after invocation -- Per-skill: workflow step completion markers - -### Step 3: Run 8 Analysis Dimensions - -**You MUST run ALL 8 dimensions.** The baseline behavior without this skill is to skip dimensions 4.2, 4.3, 4.5b, and 4.8. These are the most valuable dimensions — do not skip them. - -#### 4.1 Trigger Rate - -Count how many times each skill was actually invoked vs how many times its trigger keywords appeared in user messages. - -**Claude Code:** count `Skill` tool_use calls in transcripts. -**Codex:** count sessions where the agent produced output following the skill's workflow markers (not merely loaded in context). - -**Diagnose:** -- Never triggered → skill may be useless or trigger words wrong -- Keywords match >> actual invocations → undertrigger problem, description needs work -- High frequency → core skill, worth optimizing - -#### 4.2 Post-Invocation User Reaction - -**This dimension is critical and easy to skip. Do not skip it.** - -After a skill is invoked in a session, read the user's next 3 messages. Classify: -- **Negative**: "no", "wrong", "never mind", "not what I wanted", user interrupts -- **Correction**: user re-describes their intent, manually overrides skill output -- **Positive**: "good", "ok", "continue", "nice", user follows the workflow -- **Silent switch**: user changes topic entirely (likely false positive trigger) - -Report per-skill satisfaction rate. - -#### 4.3 Workflow Completion Rate - -**This dimension is critical and easy to skip. Do not skip it.** - -For each skill invocation found in session data: -1. Extract the skill's defined steps from SKILL.md -2. Search the assistant messages in that session for step markers (Step N, specific output formats defined in the skill) -3. Calculate: how far did execution get? - -Report: `{skill-name} (N steps): avg completed Step X/N (Y%)` - -If a specific step is frequently where execution stops, flag it. - -#### 4.4 Static Quality Analysis - -Check each SKILL.md against these 14 rules: - -| Check | Pass Criteria | -|-------|--------------| -| Frontmatter format | Only `name` + `description`, total < 1024 chars | -| Name format | Letters, numbers, hyphens only | -| Description trigger | Starts with "Use when..." or has explicit trigger conditions | -| Description workflow leak | Description does NOT summarize the skill's workflow steps (CSO violation) | -| Description pushiness | Description actively claims scenarios where it should be used, not just passive | -| Overview section | Present | -| Rules section | Present | -| MUST/NEVER density | Count ALL-CAPS directive words; >5 per 100 words = flag. Note: Meincke et al. (2025) found persuasion directives have inconsistent effects across models. Suggest converting to concrete bright-line rules with rationale, not mere emphasis. | -| Word count | < 500 words (flag if over) | -| Narrative anti-pattern | No "In session X, we found..." storytelling — skills should be instructions, not post-hoc reports | -| YAML quoting safety | description containing `: ` must be wrapped in double quotes, otherwise YAML parse failure makes skill invisible | -| Critical info position | Core trigger conditions and primary actions must be in the first 20% of SKILL.md, not buried in the middle (Lost in the Middle, Liu et al. TACL 2024: U-shaped attention curve) | -| Description 250-char check | Primary trigger keywords must appear within the first 250 characters of description (skill listing truncation point in most agents) | -| Trigger condition count | ≤ 2 trigger conditions in description is ideal; consistent with IFEval (Zhou et al. 2023) finding that LLMs struggle with multi-constraint prompts | - -#### 4.5a False Positive Rate (Overtrigger) - -Skill was invoked but user immediately rejected or ignored it. - -#### 4.5b Undertrigger Detection - -**This is the highest-value dimension.** Memento-Skills (arXiv:2603.18743) demonstrates that skills stored as structured files require accurate retrieval/routing to be effective — skills that are never retrieved cannot improve through their read-write learning loop, making undertriggering a compounding problem. - -For each skill, extract its **capability keywords** (not just trigger keywords — what the skill CAN do). Then scan user messages for tasks that match those capabilities but where the skill was NOT invoked. - -Example: user says "run these tasks in parallel" but parallel-runner was not triggered → undertrigger. - -Report: which user messages SHOULD have triggered the skill but didn't, and suggest description improvements. - -**Compounding Risk Assessment:** -For skills with chronic undertriggering (0 triggers across 5+ sessions where relevant tasks appeared), flag as "compounding risk" — undertriggered skills cannot self-improve through usage feedback, causing the gap to widen over time. Recommend immediate description rewrite as P0. - -#### 4.6 Cross-Skill Conflicts - -Compare all skill pairs: -- Trigger keyword overlap (same keywords in two descriptions) -- Workflow overlap (two skills teach similar processes) -- Contradictory guidance - -#### 4.7 Environment Consistency - -For each skill, extract referenced: -- File paths → check if they exist (`test -e`) -- CLI tools → check if installed (`which`) -- Directories → check if they exist - -Flag any broken references. - -#### 4.8 Token Economics - -**This dimension is critical and easy to skip. Do not skip it.** - -For each skill: -- Word count (from Step 1) -- Trigger frequency (from 4.1) -- Cost-effectiveness = trigger count / word count -- Flag: large + never-triggered skills as candidates for removal or compression - -**Progressive Disclosure Tier Check:** -Evaluate each skill against the 3-tier loading model (Agent Skills spec): -- Tier 1 (frontmatter): ~100 tokens. Check: is description ≤ 1024 chars? -- Tier 2 (SKILL.md body): <500 lines recommended. Check: word count. -- Tier 3 (reference files): loaded on demand. Check: does skill use reference files for detailed content, or cram everything into SKILL.md? - -Flag skills that put 500+ words in SKILL.md without using reference files as "poor progressive disclosure". - -### Step 4: Composite Score - -Rate each skill on a 5-point scale: - -| Score | Meaning | -|-------|---------| -| 5 | Healthy: high trigger rate, positive reactions, complete workflows, clean static | -| 4 | Good: minor issues in 1-2 dimensions | -| 3 | Needs attention: significant gap in 1 dimension or minor gaps in 3+ | -| 2 | Problematic: never triggered, or negative user reactions, or major static issues | -| 1 | Broken: doesn't work, references missing, or fundamentally misaligned | - -**Scored dimensions** (weighted average): -- Trigger rate: 25% -- User reaction: 20% -- Workflow completion: 15% -- Static quality: 15% -- Undertrigger: 15% -- Token economics: 10% - -**Qualitative dimensions** (reported but not scored — no reliable numeric metric): -- 4.5a Overtrigger: reported as count + examples -- 4.6 Cross-Skill Conflicts: reported as conflict pairs -- 4.7 Environment Consistency: reported as pass/fail per reference - -(If a scored dimension has no data — e.g., skill was never invoked so no user reaction — mark as "N/A" and redistribute weight.) - -## Report Format - -```markdown -# Skill Optimization Report -**Date**: {date} -**Scope**: {all / specified skills} -**Session data**: {N} sessions, {date range} - -## Overview -| Skill | Triggers | Reaction | Completion | Static | Undertrigger | Token | Score | -|-------|----------|----------|------------|--------|--------------|-------|-------| -| example-skill | 2 | 100% | 86% | B+ | 1 miss | 486w | 4/5 | - -## P0 Fixes (blocking usage) -1. ... - -## P1 Improvements (better experience) -1. ... - -## P2 Optional Optimizations -1. ... - -## Per-Skill Diagnostics -### {skill-name} -#### 4.1 Trigger Rate -... -#### 4.2 User Reaction -... -(all 8 dimensions) -``` - -## Research Background -The analysis dimensions in this report are grounded in the following research: -- **Undertrigger detection**: Memento-Skills (arXiv:2603.18743) — skills as structured files require accurate routing; unrouted skills cannot self-improve via the read-write learning loop -- **Description quality**: MCP Description Quality (arXiv:2602.18914) — well-written descriptions achieve 72% tool selection rate vs. 20% random baseline (3.6x improvement) -- **Information position**: Lost in the Middle (Liu et al., TACL 2024) — U-shaped LLM attention curve -- **Format impact**: He et al. (arXiv:2411.10541) — format changes alone can cause 9-40% performance variance -- **Instruction compliance**: IFEval (arXiv:2311.07911) — LLMs struggle with multi-constraint prompts +- `references/session-analysis.md`: installed-skill mode, source-repo mode, and platform-specific evidence collection +- `references/audit-framework.md`: the 8 dimensions, scoring rules, and static-check rubric +- `references/report-template.md`: report shape and wording guidance +- `references/repo-mode-example.md`: example reasoning for a docs-first public skill repository with sparse routing telemetry diff --git a/skills/skill-optimizer/references/audit-framework.md b/skills/skill-optimizer/references/audit-framework.md new file mode 100644 index 0000000..94f3b84 --- /dev/null +++ b/skills/skill-optimizer/references/audit-framework.md @@ -0,0 +1,115 @@ +# Audit Framework + +Run all 8 dimensions every time. + +If a dimension lacks enough evidence, report `N/A — insufficient session data` or `low confidence — inferred from source-repo evidence`. + +## 4.1 Trigger Rate + +- Compare actual invocations with relevant user-task opportunities. +- Never treat Codex skill loading by itself as invocation. + +## 4.2 Post-Invocation User Reaction + +- Read the next few user turns after invocation. +- Classify positive, correction, negative, or silent switch. + +## 4.3 Workflow Completion Rate + +- Extract the skill's intended steps from `SKILL.md`. +- Check whether the assistant output completed those steps, or whether validation logs show the workflow reaching the expected end state. + +## 4.4 Static Quality Analysis + +Check at least these areas: + +| Check | Pass Criteria | +|-------|---------------| +| Frontmatter format | Only `name` + `description`, total < 1024 chars | +| Name format | Letters, numbers, hyphens only | +| Description trigger | Starts with "Use when..." or contains explicit trigger conditions | +| Description workflow leak | Do not front-load step-by-step workflow or output-format detail in frontmatter | +| Description disambiguation | Short action anchors are acceptable when they separate adjacent skills | +| Description pushiness | Claims the right use cases instead of staying purely passive | +| Overview section | Present | +| Rules section | Present | +| Word count | Flag if the body is unnecessarily large | +| Narrative anti-pattern | Avoid post-hoc storytelling inside the core skill | +| YAML quoting safety | Quote descriptions that contain `: ` | +| Critical info position | Core trigger and boundary info appears early | +| Description 250-char check | Primary routing clues appear before common truncation points | +| Progressive disclosure | Detailed content lives in `references/` when possible | + +## 4.5a Overtrigger + +- Look for invocations that were immediately rejected, corrected, or abandoned. + +## 4.5b Undertrigger + +- Look for user tasks that match the skill's capability but did not result in invocation. +- In source-repo mode, treat this as low confidence unless you have direct routing evidence. + +## 4.6 Cross-Skill Conflicts + +Compare adjacent skills for: + +- overlapping trigger vocabulary +- overlapping workflow territory +- contradictory boundaries or guidance + +## 4.7 Environment Consistency + +Check referenced: + +- file paths +- directories +- CLI tools +- repo-owned docs or examples + +Document intentional host-side prerequisites separately from broken references. + +## 4.8 Token Economics + +Consider: + +- description length +- body length +- progressive disclosure quality +- trigger frequency when available + +Large skills without references or usage evidence are candidates for compression. + +## Composite Score + +Use a 5-point scale: + +| Score | Meaning | +|-------|---------| +| 5 | Strong live routing evidence, healthy outcomes, clean static quality | +| 4 | Good static quality and workflow evidence, but some gaps remain | +| 3 | Meaningful concerns or missing proof in multiple dimensions | +| 2 | Major routing, quality, or environment problems | +| 1 | Broken or fundamentally misaligned | + +Redistribute weights if a scored dimension is `N/A`. + +## Severity Calibration + +Do not treat every evidence gap as the same class of problem. + +Use `P0` when: + +- the repository makes claims that are contradicted by the evidence +- a skill is broken, misleading, or unsafe to use +- the current release or milestone explicitly depends on live routing proof + +Use `P1` when: + +- the repository is honest that it is draft, beta, docs-first, or still collecting validation evidence +- the missing item is the next most valuable proof step, but not a contradiction of current public claims +- the issue weakens confidence more than it breaks current usability + +Use `P2` when: + +- the issue is mostly polish, wording, or future-proofing +- fixing it would improve routing odds but is not the main confidence bottleneck diff --git a/skills/skill-optimizer/references/repo-mode-example.md b/skills/skill-optimizer/references/repo-mode-example.md new file mode 100644 index 0000000..cd63866 --- /dev/null +++ b/skills/skill-optimizer/references/repo-mode-example.md @@ -0,0 +1,35 @@ +# Repo-Mode Example + +This is a good pattern for auditing a public skill repository that is still maturing. + +## Example Scenario + +- the repository has 4 public skills +- each skill has a concise `SKILL.md`, `references/`, and example prompts +- CI and local validators pass +- there are validation logs and forward tests +- most historical sessions are maintainer sessions about building the repo, not clean installed-skill consumption +- the repository describes itself as docs-first, draft, or beta + +## Correct Audit Conclusion + +- static quality can still score highly +- workflow completion can often be judged from validation logs +- trigger rate, user reaction, and undertrigger may need to be `N/A` or low confidence +- missing live routing evidence is usually the top `P1` or "next-milestone evidence work", not automatically `P0` + +## When It Becomes P0 + +Upgrade that same gap to `P0` only if one of these is true: + +- the repository claims the skills are already routing-proven in real agent use +- the release gate explicitly requires transcript-backed routing evidence +- users would be misled into assuming the routing quality is already proven + +## Recommended Next Step + +Add one small routing-eval pack instead of building a huge telemetry system first: + +1. one installed-skill transcript or replayable eval prompt per public skill +2. one nearby non-use prompt for each adjacent skill pair +3. one short note separating "repo-authoring evidence" from "installed-skill usage evidence" diff --git a/skills/skill-optimizer/references/report-template.md b/skills/skill-optimizer/references/report-template.md new file mode 100644 index 0000000..6f1f2d7 --- /dev/null +++ b/skills/skill-optimizer/references/report-template.md @@ -0,0 +1,58 @@ +# Report Template + +Use this shape unless the user asks for something else. + +```markdown +# Skill Optimization Report +**Date**: {date} +**Scope**: {all / specified skills} +**Evidence**: {session count, validation logs, CI, validators, etc.} +**Confidence**: {high / medium / low} +**Release stage**: {production / public beta / docs-first / draft} + +## Overview +| Skill | Trigger | Reaction | Completion | Static | Undertrigger | Token | Score | +|-------|---------|----------|------------|--------|--------------|-------|-------| +| example-skill | 2/7 | 100% | 86% | strong | 1 miss | 320w | 4/5 | + +## P0 Fixes +1. ... + +## P1 Improvements +1. ... + +## P2 Optional Optimizations +1. ... + +## Milestone Fit +- current-milestone blockers: ... +- next-milestone evidence work: ... + +## Per-Skill Diagnostics +### {skill-name} +#### 4.1 Trigger Rate +... +#### 4.2 User Reaction +... +#### 4.3 Workflow Completion +... +#### 4.4 Static Quality +... +#### 4.5a Overtrigger +... +#### 4.5b Undertrigger +... +#### 4.6 Cross-Skill Conflicts +... +#### 4.7 Environment Consistency +... +#### 4.8 Token Economics +... +``` + +Keep the report honest: + +- `N/A` is better than a fake metric. +- Call out confidence level whenever the evidence is mostly static or curated. +- Quote or summarize the exact user task when claiming undertrigger. +- If the repository is still in a draft or beta stage, say whether a missing proof item blocks the current milestone or only the next maturity step. diff --git a/skills/skill-optimizer/references/session-analysis.md b/skills/skill-optimizer/references/session-analysis.md new file mode 100644 index 0000000..1500050 --- /dev/null +++ b/skills/skill-optimizer/references/session-analysis.md @@ -0,0 +1,55 @@ +# Session Analysis + +Use the strongest evidence available, and say what kind of evidence you used. + +## Installed-Skill Mode + +Use this when the skills are already installed under an agent skill directory. + +Scan these locations if they exist: + +| Source | Claude Code | Codex | Shared | +|--------|-------------|-------|--------| +| Skill files | `~/.claude/skills/*/SKILL.md` | `~/.codex/skills/*/SKILL.md` | `~/.agents/skills/*/SKILL.md` | +| Session transcripts | `~/.claude/projects/**/*.jsonl` | `~/.codex/sessions/**/*.jsonl` | — | + +Notes: + +- A user may have more than one platform installed. Scan every relevant location and deduplicate by skill name. +- For Claude Code, explicit `Skill` tool use is strong invocation evidence. +- For Codex, skill loading in `base_instructions` is not enough. Look for workflow markers, structured output sections, or explicit prompt/result evidence that the skill actually shaped the answer. + +## Source-Repository Mode + +Use this when the skills live in a repository that is being prepared, validated, or published, but may not yet have enough live routing history. + +Prefer these sources: + +- repository `skills/` folders and their `SKILL.md` frontmatter +- `references/` files, `agents/openai.yaml`, and repo-owned validators +- validation logs, review checklists, CI workflows, and recorded forward tests +- local session transcripts that show maintainers working on the repository + +Important: + +- Maintainer sessions about authoring the repository are not the same as end-user skill consumption. +- If the only evidence is authoring or curated validation, mark trigger rate, user reaction, and undertrigger findings as low confidence or `N/A`. +- Still run all 8 dimensions. The goal is to expose evidence gaps, not to pretend they do not exist. +- Judge severity in context. A docs-first beta repo that openly says its skills are still being validated does not have the same routing-evidence burden as a repo claiming production-ready agent routing. + +## Shell Guidance + +- On Windows, prefer PowerShell-native commands. +- On macOS/Linux, prefer the native POSIX shell. +- Do not require Bash-specific syntax when the environment does not provide Bash. + +## Evidence Ranking + +Prefer evidence in this order: + +1. direct invocation evidence from session transcripts +2. prompt/result validation logs and forward-test records +3. repository validators, CI, and static structure +4. inferred risk from wording overlap or missing boundaries + +Lower-ranked evidence can support a claim, but should not be presented as if it were direct usage telemetry.