Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ Works with **Claude Code**, **Codex**, and any agent supporting the Agent Skills

Most skill auditors only do static checks on your SKILL.md. This one also mines your actual session transcripts to measure trigger rates, user satisfaction, workflow completion, and undertrigger gaps — then scores each skill on a 5-point composite scale.

It also works in a **source-repository audit mode** for public skill collections that are still building usage evidence. In that mode it uses validators, example prompts, validation logs, CI, and maintainer sessions as secondary evidence, while clearly marking trigger-rate and reaction findings as low-confidence or `N/A` when live routing data is missing.

## What It Does

**6 scored dimensions** (weighted into composite score):
Expand All @@ -29,6 +31,24 @@ Most skill auditors only do static checks on your SKILL.md. This one also mines
| **Cross-Skill Conflicts** | Trigger keyword overlap and contradictory guidance between skills |
| **Environment Consistency** | Broken file paths, missing CLI tools, non-existent directories |

## Audit Modes

### Installed Skill Mode

Use this when the skills are already installed under `~/.claude/skills/`, `~/.codex/skills/`, or `~/.agents/skills/`. This is the highest-confidence mode because invocation evidence can come directly from session transcripts.

### Source Repository Mode

Use this when you are auditing a skill repository before or alongside publication. The optimizer can still run all 8 dimensions, but it treats repo validators, example prompts, validation logs, review checklists, and CI as fallback evidence rather than pretending they are the same as live routing telemetry.

## When Routing Eval Is Necessary

Routing-eval or transcript evidence is not equally urgent for every repository.

- Treat it as `P0` when the repository claims the skills are already routing-proven, production-ready, or validated in real agent use.
- Treat it as `P1` or next-milestone evidence work when the repository is honestly positioned as docs-first, draft, or beta and already separates proven behavior from future validation goals.
- Treat it as optional only when the repository is not making routing claims at all and the audit goal is purely static cleanup.

## Installation

Copy the command below and paste it directly into your agent's chat — it will install automatically:
Expand Down Expand Up @@ -71,6 +91,15 @@ cp -r /tmp/skill-optimizer/skills/skill-optimizer ~/.agents/skills/
rm -rf /tmp/skill-optimizer
```

```powershell
# Windows PowerShell example for Codex
$target = Join-Path $env:TEMP 'skill-optimizer'
git clone https://github.com/hqhq1025/skill-optimizer.git $target
New-Item -ItemType Directory -Force -Path "$HOME\\.codex\\skills" | Out-Null
Copy-Item -Recurse -Force "$target\\skills\\skill-optimizer" "$HOME\\.codex\\skills\\"
Remove-Item -Recurse -Force $target
```

</details>

## Usage
Expand Down Expand Up @@ -98,6 +127,15 @@ The optimizer auto-detects available platforms and scans session data from all o
| Codex | `~/.codex/skills/` | `~/.codex/sessions/**/*.jsonl` |
| Shared | `~/.agents/skills/` | — |

For Codex, skill loading in `base_instructions` is not enough to prove actual use. The optimizer looks for workflow markers or explicit prompt/result evidence before counting an invocation.

When auditing a source repository instead of an installed skill directory, the optimizer can also use:

- repo-owned validators
- `references/` files and `agents/openai.yaml`
- example prompts and validation logs
- CI workflows and forward-test records

## Research Background

The analysis dimensions are grounded in peer-reviewed research:
Expand Down Expand Up @@ -125,6 +163,58 @@ Compute composite scores (weighted average of 6 scored dimensions)
Output report with P0/P1/P2 prioritized fixes
```

When session data is sparse, the optimizer still runs all 8 dimensions and explicitly marks any unsupported metrics as `N/A` instead of fabricating a score.

## Example Repository Audit

Example: a docs-first miniapp skill repository contains 4 public skills, passes validators and CI, has example prompts plus forward-test notes, but only has maintainer sessions instead of clean installed-skill routing transcripts.

The correct audit result is:

- score static quality and progressive disclosure normally
- use validation logs as medium-confidence workflow evidence
- mark trigger rate, user reaction, and undertrigger as low-confidence or `N/A`
- report missing routing transcripts as the highest-priority `P1` for the next maturity step, not as a false `P0`, unless the repo is already claiming routing proof

## Example Output

```markdown
# Skill Optimization Report
**Date**: 2026-03-30
**Scope**: all public skills in `miniprogram_skills`
**Evidence**: validator pass, CI, validation log, 8 maintainer sessions
**Confidence**: static=high, workflow=medium, routing=low
**Release stage**: docs-first public beta

## Overview
| Skill | Trigger | Reaction | Completion | Static | Undertrigger | Token | Score |
|-------|---------|----------|------------|--------|--------------|-------|-------|
| miniapp-devtools-cli-repair | N/A | N/A | strong | strong | N/A | strong | 4/5 |
| miniapp-devtools-gui-check | N/A | N/A | strong | strong | N/A | strong | 4/5 |

## P0 Fixes
None from the current evidence set.

## P1 Improvements
1. Add one installed-skill transcript or replayable routing eval per public skill.
2. Add negative-path validation for adjacent skill boundaries.

## Milestone Fit
- current-milestone blockers: none beyond already-declared beta limits
- next-milestone evidence work: transcript-backed routing proof

## Per-Skill Diagnostics
### miniapp-devtools-gui-check
#### 4.1 Trigger Rate
N/A — insufficient live routing evidence
#### 4.3 Workflow Completion
Strong. Validation logs show one narrow host-side route check reaching a real report.
#### 4.6 Cross-Skill Conflicts
Moderate but controlled. Primary overlap is with miniapp-devtools-cli-repair.
```

This is the intended behavior for source-repository mode: keep the report honest, keep all 8 dimensions, and avoid overstating routing quality when the available evidence is mostly static or curated.

**Scored dimensions (weighted average):**
- Trigger rate: 25%
- User reaction: 20%
Expand Down
90 changes: 90 additions & 0 deletions README.zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

大多数 skill 审计工具只做 SKILL.md 的静态检查。这个工具还会挖掘你的真实 session 记录,量化触发率、用户满意度、workflow 完成率和漏触发缺口,最终为每个 skill 打出 5 分制综合评分。

它也支持一种 **source repository 审计模式**:当一个公开 skill 仓库还在积累真实使用证据时,可以退回到 validator、example prompts、validation log、CI 和维护者 session 这些次级证据,同时明确把触发率、用户反应之类缺少真实路由样本的维度标成低置信度或 `N/A`。

## 功能

**6 个评分维度**(加权计入综合分):
Expand All @@ -29,6 +31,24 @@
| **跨 Skill 冲突** | 触发关键词重叠和 skill 间矛盾指导 |
| **环境一致性** | 文件路径失效、CLI 工具缺失、目录不存在 |

## 审计模式

### 已安装 Skill 模式

当目标 skill 已经装在 `~/.claude/skills/`、`~/.codex/skills/` 或 `~/.agents/skills/` 下时,优先使用这个模式。它的置信度最高,因为可以直接从 session transcript 里寻找真实调用证据。

### Source Repository 模式

当你审计的是一个准备公开、还没积累足够 live routing 数据的 skill 仓库时,使用这个模式。优化器仍然会跑完整 8 维,但会把 repo validator、example prompts、validation log、review checklist 和 CI 当成 fallback evidence,而不会假装它们等同于真实路由遥测。

## 什么时候必须补 Routing Eval

routing-eval / transcript 证据并不是对每个仓库都同等紧急。

- 当仓库声称这些 skills 已经 routing-proven、production-ready,或者已经在真实 agent 使用里验证过时,这就是 `P0`。
- 当仓库明确把自己定位成 docs-first、draft 或 beta,而且已经诚实地区分“已证明”和“待验证”时,这更适合作为 `P1` 或下一阶段证据工作。
- 当审计目标只是做静态清理、仓库本身也没有做任何路由成熟度承诺时,这项工作可以先不排在最前面。

## 安装

复制下面的指令,直接粘贴到你的 agent 对话中即可自动安装:
Expand Down Expand Up @@ -71,6 +91,15 @@ cp -r /tmp/skill-optimizer/skills/skill-optimizer ~/.agents/skills/
rm -rf /tmp/skill-optimizer
```

```powershell
# Windows PowerShell 示例(Codex)
$target = Join-Path $env:TEMP 'skill-optimizer'
git clone https://github.com/hqhq1025/skill-optimizer.git $target
New-Item -ItemType Directory -Force -Path "$HOME\\.codex\\skills" | Out-Null
Copy-Item -Recurse -Force "$target\\skills\\skill-optimizer" "$HOME\\.codex\\skills\\"
Remove-Item -Recurse -Force $target
```

</details>

## 使用
Expand Down Expand Up @@ -98,6 +127,15 @@ rm -rf /tmp/skill-optimizer
| Codex | `~/.codex/skills/` | `~/.codex/sessions/**/*.jsonl` |
| 共享 | `~/.agents/skills/` | — |

对于 Codex,`base_instructions` 里出现 skill 被加载,并不等于 skill 真的被调用。优化器会继续寻找 workflow marker 或明确的 prompt/result 证据,再把它计作一次 invocation。

如果审计的是 source repository,而不是已经安装到本地目录的 skill,优化器还可以读取:

- 仓库自带 validator
- `references/` 文件和 `agents/openai.yaml`
- example prompts 与 validation log
- CI workflow 与 forward-test 记录

## 研究背景

分析维度基于同行评审的学术研究:
Expand Down Expand Up @@ -125,6 +163,58 @@ rm -rf /tmp/skill-optimizer
输出 P0/P1/P2 优先级修复报告
```

当 session 数据不足时,优化器仍然会坚持跑完整 8 个维度,并把证据不足的指标明确标成 `N/A`,而不是伪造分数。

## 示例仓库审计

示例:一个 docs-first 的小程序 skill 仓库里有 4 个 public skills,validator 和 CI 都通过,也有 example prompts、validation log 和 forward-test 记录,但历史 session 主要还是维护者在建设仓库,而不是已经安装好的 skills 在真实对话里被稳定路由。

这时正确的审计结论应该是:

- 静态质量和渐进式加载可以正常打分
- workflow completion 可以把 validation log 当作中等置信度证据
- 触发率、用户反应、漏触发要标成低置信度或 `N/A`
- 缺少 routing transcript 应该被列成“下一阶段最重要的 `P1`”,而不是机械地判成 `P0`;除非仓库已经对外宣称自己有真实路由证明

## 示例输出

```markdown
# Skill Optimization Report
**Date**: 2026-03-30
**Scope**: `miniprogram_skills` 里的全部 public skills
**Evidence**: validator 通过、CI、validation log、8 条维护者 session
**Confidence**: static=high, workflow=medium, routing=low
**Release stage**: docs-first public beta

## Overview
| Skill | Trigger | Reaction | Completion | Static | Undertrigger | Token | Score |
|-------|---------|----------|------------|--------|--------------|-------|-------|
| miniapp-devtools-cli-repair | N/A | N/A | strong | strong | N/A | strong | 4/5 |
| miniapp-devtools-gui-check | N/A | N/A | strong | strong | N/A | strong | 4/5 |

## P0 Fixes
当前证据集下无 P0。

## P1 Improvements
1. 给每个 public skill 补 1 条 installed-skill transcript 或可重放 routing eval。
2. 给相邻 skill 边界补负路径验证。

## Milestone Fit
- current-milestone blockers: 除仓库已声明的 beta 限制外,无新增阻塞
- next-milestone evidence work: transcript-backed routing proof

## Per-Skill Diagnostics
### miniapp-devtools-gui-check
#### 4.1 Trigger Rate
N/A — 缺少足够的 live routing evidence
#### 4.3 Workflow Completion
Strong。validation log 证明过一次真实的窄路由宿主机检查并生成报告。
#### 4.6 Cross-Skill Conflicts
Moderate but controlled。主要重叠对象是 miniapp-devtools-cli-repair。
```

这就是 source-repository 模式下期望的输出风格:保留完整 8 维,诚实标注证据等级,不因为缺少 live routing 数据就伪造一个看起来很确定的结论。

**评分维度(加权平均):**
- 触发率:25%
- 用户反应:20%
Expand Down
Loading