Skill quality dashboard: path adherence scoring, token tracking, and CI pipeline by banibrata-de · Pull Request #1102 · microsoft/GitHub-Copilot-for-Azure

banibrata-de · 2026-03-03T00:10:41Z

Skill Quality Dashboard & Reporting Pipeline

What

Adds automated quality reporting infrastructure for Copilot skill integration tests a skill-quality-report.json artifact and the generate-quality-report.js script that produces it. This enables the Copilot Skills Quality Dashboard to visualize skill health, path adherence, and token usage from CI artifacts.

Why

Today, integration tests only assert pass/fail. A test can pass while the agent takes a suboptimal path (e.g., answering from cached docs instead of calling the intended MCP tools). This creates blind spots in skill quality that are invisible until end-users report issues.

Path adherence scoring compares actual agent tool-call sequences against expected workflows derived from passing traces, catching:

Skills that answer from documentation instead of querying live data
Routing loops where the agent calls view repeatedly without progressing
Missing MCP tool invocations (e.g., skipping azure-quota in a quota skill)

Live Dashboard

https://foundry-e2etest.azurewebsites.net/api/copilot-skills

The dashboard renders from workflow artifacts automatically. It shows per-area and per-test metrics:

Pass/fail status with reliability & consistency scores
Path adherence (expected vs actual tool sequences)
Token usage (input/output per test)
LLM call count
Duration
Interactive trace viewer with Mermaid flow diagrams

Changes

File	What
`tests/scripts/generate-quality-report.js`	New Parses JUnit XML + agent-metadata markdown to produce `skill-quality-report.json` with traces, path adherence, and recommendations
`tests/utils/agent-runner.ts`	Token usage tracking (input/output/cache tokens, API call count, duration) extracted from `assistant.usage` and `session.shutdown` events
`tests/package.json`	Added `quality-report` script
`.github/workflows/test-all-integration.yml`	Added "Generate quality report" step that runs after tests and uploads results in the artifact
`tests/microsoft-foundry/resource/create/integration.test.ts`	Fixed ESM `__dirname` bug (6 tests were failing due to bare `__dirname` in ESM module)

Quality Metrics (current state from local test data)

81% pass rate (53/65 tests)
51100% path adherence across 10 skill areas
8/12 areas fully green

How It Works

Integration tests run via test-all-integration.yml (nightly or manual dispatch)
agent-runner.ts captures token usage and writes token-usage.json per test
Post-test step runs generate-quality-report.js produces skill-quality-report.json
Artifact (integration-report-*) is uploaded with JUnit XML + quality report
Dashboard fetches latest artifact via GitHub API and renders the report

Testing

ESLint passes
TypeScript compiles (tsc --noEmit)
CodeQL no alerts (TOCTOU race fixed, unused functions removed)
All existing tests unaffected (only additive changes + 1 ESM bugfix)

…eline step - Add generate-quality-report.js: parses agent-metadata traces, computes path adherence scores, generates skill-quality-report.json - Update agent-runner.ts: capture token usage, LLM call metadata, and tool call formatting for both skill: and tool: code blocks - Add quality-report npm script to package.json - Add 'Generate quality report' step to GitHub Actions workflow - Define EXPECTED_PATHS for 10 skill areas based on real passing traces Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The test used bare __dirname which is not available in ESM modules. Added fileURLToPath(import.meta.url) polyfill matching the pattern used in jest.setup.ts and other test files. Fixes 6 test failures: Workflow Documentation, Command Validation, and References Pattern tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add eslint-disable for unused utility functions (loadPerTestTokenUsage, extractTestCase) - Use const instead of let for match variable - Use double quotes instead of single quotes - Remove unused testRunPath parameter from buildAreaSummaries Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace existsSync+readFileSync pattern with try/catch readFileSync to eliminate time-of-check-time-of-use file system race condition flagged by CodeQL as high severity. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Remove loadPerTestTokenUsage and extractTestCase functions from generate-quality-report.js (CodeQL unused-function alerts) - Remove totalCost and per-call cost fields from TokenUsage interface and all cost calculations in agent-runner.ts (not used in reports) - Keep token count tracking (inputTokens, outputTokens, cache tokens) Addresses all 3 CodeQL review comments from PR #1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds an automated “skill quality” reporting layer on top of the existing integration test outputs so downstream systems (dashboard) can consume a single JSON contract with token usage, traces, and path-adherence metrics.

Changes:

Added token usage capture + per-test/token-summary JSON emission in the agent runner.
Added a new generate-quality-report.js script to consolidate JUnit + agent traces + token data into skill-quality-report.json.
Updated CI workflow and test package scripts to generate the quality report artifact after integration tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`tests/utils/agent-runner.ts`	Captures token usage from session events and writes per-test and consolidated token usage JSON.
`tests/scripts/generate-quality-report.js`	New post-processor that builds the dashboard contract JSON (areas, token usage, traces, path adherence).
`tests/package.json`	Adds `quality-report` npm script.
`tests/microsoft-foundry/resource/create/integration.test.ts`	Attempts to fix ESM `__dirname` usage for this test.
`.github/workflows/test-all-integration.yml`	Adds a post-test step to generate the quality report in CI artifacts.

tests/microsoft-foundry/resource/create/integration.test.ts

tests/scripts/generate-quality-report.js

tests/utils/agent-runner.ts

- Remove unused path/fileURLToPath imports from integration.test.ts (__dirname comes from jest.setup.ts global) - Guard against -1 index in path adherence node lookup - Wrap loadTokenSummary in try/catch for corrupted JSON resilience - Switch token-summary from JSON to JSONL for safe concurrent writes - Redact prompts in token-usage.json via redactSecrets() - Initialize model default from modelOverride env var Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tests/scripts/generate-quality-report.js

.github/workflows/test-all-integration.yml

tests/scripts/generate-quality-report.js

tests/utils/agent-runner.ts

- Write agent-metadata.json for machine consumption (comment microsoft#7) - Rename apiCall to llmCall consistently (comment microsoft#8) - Quality report step only runs for microsoft-foundry (comment microsoft#9) - Add comprehensive JSDoc to buildTraces function (comment microsoft#10) - Fix passRate: simplify math, return null when no tests (comment microsoft#11) - Remove duplicate isSkillInvoked/getToolCalls, re-export from evaluate.ts (comments microsoft#12, microsoft#13) - All 6 Copilot reviewer comments already addressed in prior commits Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

tests/utils/agent-runner.ts:205

The comments inside these switch case blocks are not indented relative to the block opening brace. With the repo’s ESLint indent rule (2 spaces, SwitchCase: 1), this is likely to fail lint. Indent these comment lines to match the surrounding statements (same applies to the similar comment lines in the other case blocks nearby).

      case "assistant.message_delta": {
      // Accumulate deltas for streaming - we'll use the final message instead
        const messageId = event.data.messageId as string;
        const deltaContent = event.data.deltaContent as string;
        if (messageId && deltaContent) {
          messageDeltas[messageId] = (messageDeltas[messageId] || "") + deltaContent;
        }

tests/scripts/generate-quality-report.js

tests/utils/agent-runner.ts

tests/scripts/generate-quality-report.js

tests/utils/agent-runner.ts

tests/scripts/generate-quality-report.js

tests/utils/agent-runner.ts

…ix event types - Remove extractToolCallsFromMarkdown() entirely agent-metadata.json is the only source - Remove markdown fallback in extractToolCalls() return empty if no JSON - Apply redactSecrets() to full JSON text instead of just the prompt field - Fix event type matching: use SDK's tool.execution_start with data.toolName - Fix testedAreas using a.name instead of a.area (property name bug) - Update JSDoc to remove stale markdown fallback reference Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

banibrata-de · 2026-03-03T21:23:50Z

Addressed all review comments from @JasonYeMSFT in the latest push (15558d9):

Removed markdown fallback entirely extractToolCallsFromMarkdown() and all regex parsing deleted. Report now only reads agent-metadata.json.
redactSecrets on full JSON Changed to redactSecrets(JSON.stringify(jsonData, null, 2)) so the entire JSON text is redacted, not just the prompt field.
Fixed event type matching Updated to use SDK actual tool.execution_start with data.toolName instead of assistant.tool_call.
Fixed a.area to a.name Bug fix in coverage: area objects use name property, not area.
Thread safety Acknowledged; each test writes to its own directory so there is no file-level contention.

Net result: -77 lines (29 added, 106 removed).

JasonYeMSFT · 2026-03-03T22:37:26Z

@banibrata-de You also need to agree to the Contributor License Agreement.

banibrata-de · 2026-03-03T23:36:50Z

@microsoft-github-policy-service agree company="Microsoft"

banibrata-de and others added 5 commits March 2, 2026 16:09

Copilot AI review requested due to automatic review settings March 3, 2026 00:10

Copilot started reviewing on behalf of banibrata-de March 3, 2026 00:11 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

erjohnms approved these changes Mar 3, 2026

View reviewed changes