Skill quality dashboard: path adherence scoring, token tracking, and CI pipeline#1102
Conversation
…eline step - Add generate-quality-report.js: parses agent-metadata traces, computes path adherence scores, generates skill-quality-report.json - Update agent-runner.ts: capture token usage, LLM call metadata, and tool call formatting for both skill: and tool: code blocks - Add quality-report npm script to package.json - Add 'Generate quality report' step to GitHub Actions workflow - Define EXPECTED_PATHS for 10 skill areas based on real passing traces Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test used bare __dirname which is not available in ESM modules. Added fileURLToPath(import.meta.url) polyfill matching the pattern used in jest.setup.ts and other test files. Fixes 6 test failures: Workflow Documentation, Command Validation, and References Pattern tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add eslint-disable for unused utility functions (loadPerTestTokenUsage, extractTestCase) - Use const instead of let for match variable - Use double quotes instead of single quotes - Remove unused testRunPath parameter from buildAreaSummaries Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace existsSync+readFileSync pattern with try/catch readFileSync to eliminate time-of-check-time-of-use file system race condition flagged by CodeQL as high severity. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove loadPerTestTokenUsage and extractTestCase functions from generate-quality-report.js (CodeQL unused-function alerts) - Remove totalCost and per-call cost fields from TokenUsage interface and all cost calculations in agent-runner.ts (not used in reports) - Keep token count tracking (inputTokens, outputTokens, cache tokens) Addresses all 3 CodeQL review comments from PR #1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds an automated “skill quality” reporting layer on top of the existing integration test outputs so downstream systems (dashboard) can consume a single JSON contract with token usage, traces, and path-adherence metrics.
Changes:
- Added token usage capture + per-test/token-summary JSON emission in the agent runner.
- Added a new
generate-quality-report.jsscript to consolidate JUnit + agent traces + token data intoskill-quality-report.json. - Updated CI workflow and test package scripts to generate the quality report artifact after integration tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/utils/agent-runner.ts |
Captures token usage from session events and writes per-test and consolidated token usage JSON. |
tests/scripts/generate-quality-report.js |
New post-processor that builds the dashboard contract JSON (areas, token usage, traces, path adherence). |
tests/package.json |
Adds quality-report npm script. |
tests/microsoft-foundry/resource/create/integration.test.ts |
Attempts to fix ESM __dirname usage for this test. |
.github/workflows/test-all-integration.yml |
Adds a post-test step to generate the quality report in CI artifacts. |
- Remove unused path/fileURLToPath imports from integration.test.ts (__dirname comes from jest.setup.ts global) - Guard against -1 index in path adherence node lookup - Wrap loadTokenSummary in try/catch for corrupted JSON resilience - Switch token-summary from JSON to JSONL for safe concurrent writes - Redact prompts in token-usage.json via redactSecrets() - Initialize model default from modelOverride env var Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Write agent-metadata.json for machine consumption (comment microsoft#7) - Rename apiCall to llmCall consistently (comment microsoft#8) - Quality report step only runs for microsoft-foundry (comment microsoft#9) - Add comprehensive JSDoc to buildTraces function (comment microsoft#10) - Fix passRate: simplify math, return null when no tests (comment microsoft#11) - Remove duplicate isSkillInvoked/getToolCalls, re-export from evaluate.ts (comments microsoft#12, microsoft#13) - All 6 Copilot reviewer comments already addressed in prior commits Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
tests/utils/agent-runner.ts:205
- The comments inside these
switchcase blocks are not indented relative to the block opening brace. With the repo’s ESLint indent rule (2 spaces, SwitchCase: 1), this is likely to fail lint. Indent these comment lines to match the surrounding statements (same applies to the similar comment lines in the other case blocks nearby).
case "assistant.message_delta": {
// Accumulate deltas for streaming - we'll use the final message instead
const messageId = event.data.messageId as string;
const deltaContent = event.data.deltaContent as string;
if (messageId && deltaContent) {
messageDeltas[messageId] = (messageDeltas[messageId] || "") + deltaContent;
}
…ix event types - Remove extractToolCallsFromMarkdown() entirely agent-metadata.json is the only source - Remove markdown fallback in extractToolCalls() return empty if no JSON - Apply redactSecrets() to full JSON text instead of just the prompt field - Fix event type matching: use SDK's tool.execution_start with data.toolName - Fix testedAreas using a.name instead of a.area (property name bug) - Update JSDoc to remove stale markdown fallback reference Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed all review comments from @JasonYeMSFT in the latest push (15558d9):
Net result: -77 lines (29 added, 106 removed). |
|
@banibrata-de You also need to agree to the Contributor License Agreement. |
|
@microsoft-github-policy-service agree company="Microsoft" |
Skill Quality Dashboard & Reporting Pipeline
What
Adds automated quality reporting infrastructure for Copilot skill integration tests a skill-quality-report.json artifact and the
generate-quality-report.jsscript that produces it. This enables the Copilot Skills Quality Dashboard to visualize skill health, path adherence, and token usage from CI artifacts.Why
Today, integration tests only assert pass/fail. A test can pass while the agent takes a suboptimal path (e.g., answering from cached docs instead of calling the intended MCP tools). This creates blind spots in skill quality that are invisible until end-users report issues.
Path adherence scoring compares actual agent tool-call sequences against expected workflows derived from passing traces, catching:
viewrepeatedly without progressingazure-quotain a quota skill)Live Dashboard
https://foundry-e2etest.azurewebsites.net/api/copilot-skills
The dashboard renders from workflow artifacts automatically. It shows per-area and per-test metrics:
Changes
tests/scripts/generate-quality-report.jsskill-quality-report.jsonwith traces, path adherence, and recommendationstests/utils/agent-runner.tsassistant.usageandsession.shutdowneventstests/package.jsonquality-reportscript.github/workflows/test-all-integration.ymltests/microsoft-foundry/resource/create/integration.test.ts__dirnamebug (6 tests were failing due to bare__dirnamein ESM module)Quality Metrics (current state from local test data)
How It Works
test-all-integration.yml(nightly or manual dispatch)agent-runner.tscaptures token usage and writestoken-usage.jsonper testgenerate-quality-report.jsproducesskill-quality-report.jsonintegration-report-*) is uploaded with JUnit XML + quality reportTesting
tsc --noEmit)