Skip to content

Skill quality dashboard: path adherence scoring, token tracking, and CI pipeline#1102

Merged
kvenkatrajan merged 8 commits intomicrosoft:mainfrom
banibrata-de:skill-quality-dashboard
Mar 3, 2026
Merged

Skill quality dashboard: path adherence scoring, token tracking, and CI pipeline#1102
kvenkatrajan merged 8 commits intomicrosoft:mainfrom
banibrata-de:skill-quality-dashboard

Conversation

@banibrata-de
Copy link
Contributor

Skill Quality Dashboard & Reporting Pipeline

What

Adds automated quality reporting infrastructure for Copilot skill integration tests a skill-quality-report.json artifact and the generate-quality-report.js script that produces it. This enables the Copilot Skills Quality Dashboard to visualize skill health, path adherence, and token usage from CI artifacts.

Why

Today, integration tests only assert pass/fail. A test can pass while the agent takes a suboptimal path (e.g., answering from cached docs instead of calling the intended MCP tools). This creates blind spots in skill quality that are invisible until end-users report issues.

Path adherence scoring compares actual agent tool-call sequences against expected workflows derived from passing traces, catching:

  • Skills that answer from documentation instead of querying live data
  • Routing loops where the agent calls view repeatedly without progressing
  • Missing MCP tool invocations (e.g., skipping azure-quota in a quota skill)

Live Dashboard

https://foundry-e2etest.azurewebsites.net/api/copilot-skills

The dashboard renders from workflow artifacts automatically. It shows per-area and per-test metrics:

  • Pass/fail status with reliability & consistency scores
  • Path adherence (expected vs actual tool sequences)
  • Token usage (input/output per test)
  • LLM call count
  • Duration
  • Interactive trace viewer with Mermaid flow diagrams

Changes

File What
tests/scripts/generate-quality-report.js New Parses JUnit XML + agent-metadata markdown to produce skill-quality-report.json with traces, path adherence, and recommendations
tests/utils/agent-runner.ts Token usage tracking (input/output/cache tokens, API call count, duration) extracted from assistant.usage and session.shutdown events
tests/package.json Added quality-report script
.github/workflows/test-all-integration.yml Added "Generate quality report" step that runs after tests and uploads results in the artifact
tests/microsoft-foundry/resource/create/integration.test.ts Fixed ESM __dirname bug (6 tests were failing due to bare __dirname in ESM module)

Quality Metrics (current state from local test data)

  • 81% pass rate (53/65 tests)
  • 51100% path adherence across 10 skill areas
  • 8/12 areas fully green

How It Works

  1. Integration tests run via test-all-integration.yml (nightly or manual dispatch)
  2. agent-runner.ts captures token usage and writes token-usage.json per test
  3. Post-test step runs generate-quality-report.js produces skill-quality-report.json
  4. Artifact (integration-report-*) is uploaded with JUnit XML + quality report
  5. Dashboard fetches latest artifact via GitHub API and renders the report

Testing

  • ESLint passes
  • TypeScript compiles (tsc --noEmit)
  • CodeQL no alerts (TOCTOU race fixed, unused functions removed)
  • All existing tests unaffected (only additive changes + 1 ESM bugfix)

banibrata-de and others added 5 commits March 2, 2026 16:09
…eline step

- Add generate-quality-report.js: parses agent-metadata traces, computes
  path adherence scores, generates skill-quality-report.json
- Update agent-runner.ts: capture token usage, LLM call metadata, and
  tool call formatting for both skill: and tool: code blocks
- Add quality-report npm script to package.json
- Add 'Generate quality report' step to GitHub Actions workflow
- Define EXPECTED_PATHS for 10 skill areas based on real passing traces

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test used bare __dirname which is not available in ESM modules.
Added fileURLToPath(import.meta.url) polyfill matching the pattern used
in jest.setup.ts and other test files.

Fixes 6 test failures: Workflow Documentation, Command Validation,
and References Pattern tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add eslint-disable for unused utility functions (loadPerTestTokenUsage, extractTestCase)
- Use const instead of let for match variable
- Use double quotes instead of single quotes
- Remove unused testRunPath parameter from buildAreaSummaries

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace existsSync+readFileSync pattern with try/catch readFileSync
to eliminate time-of-check-time-of-use file system race condition
flagged by CodeQL as high severity.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove loadPerTestTokenUsage and extractTestCase functions from
  generate-quality-report.js (CodeQL unused-function alerts)
- Remove totalCost and per-call cost fields from TokenUsage interface
  and all cost calculations in agent-runner.ts (not used in reports)
- Keep token count tracking (inputTokens, outputTokens, cache tokens)

Addresses all 3 CodeQL review comments from PR #1.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 3, 2026 00:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an automated “skill quality” reporting layer on top of the existing integration test outputs so downstream systems (dashboard) can consume a single JSON contract with token usage, traces, and path-adherence metrics.

Changes:

  • Added token usage capture + per-test/token-summary JSON emission in the agent runner.
  • Added a new generate-quality-report.js script to consolidate JUnit + agent traces + token data into skill-quality-report.json.
  • Updated CI workflow and test package scripts to generate the quality report artifact after integration tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/utils/agent-runner.ts Captures token usage from session events and writes per-test and consolidated token usage JSON.
tests/scripts/generate-quality-report.js New post-processor that builds the dashboard contract JSON (areas, token usage, traces, path adherence).
tests/package.json Adds quality-report npm script.
tests/microsoft-foundry/resource/create/integration.test.ts Attempts to fix ESM __dirname usage for this test.
.github/workflows/test-all-integration.yml Adds a post-test step to generate the quality report in CI artifacts.

- Remove unused path/fileURLToPath imports from integration.test.ts
  (__dirname comes from jest.setup.ts global)
- Guard against -1 index in path adherence node lookup
- Wrap loadTokenSummary in try/catch for corrupted JSON resilience
- Switch token-summary from JSON to JSONL for safe concurrent writes
- Redact prompts in token-usage.json via redactSecrets()
- Initialize model default from modelOverride env var

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Write agent-metadata.json for machine consumption (comment microsoft#7)
- Rename apiCall to llmCall consistently (comment microsoft#8)
- Quality report step only runs for microsoft-foundry (comment microsoft#9)
- Add comprehensive JSDoc to buildTraces function (comment microsoft#10)
- Fix passRate: simplify math, return null when no tests (comment microsoft#11)
- Remove duplicate isSkillInvoked/getToolCalls, re-export from evaluate.ts (comments microsoft#12, microsoft#13)
- All 6 Copilot reviewer comments already addressed in prior commits

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 3, 2026 05:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

tests/utils/agent-runner.ts:205

  • The comments inside these switch case blocks are not indented relative to the block opening brace. With the repo’s ESLint indent rule (2 spaces, SwitchCase: 1), this is likely to fail lint. Indent these comment lines to match the surrounding statements (same applies to the similar comment lines in the other case blocks nearby).
      case "assistant.message_delta": {
      // Accumulate deltas for streaming - we'll use the final message instead
        const messageId = event.data.messageId as string;
        const deltaContent = event.data.deltaContent as string;
        if (messageId && deltaContent) {
          messageDeltas[messageId] = (messageDeltas[messageId] || "") + deltaContent;
        }

…ix event types

- Remove extractToolCallsFromMarkdown() entirely  agent-metadata.json is the only source
- Remove markdown fallback in extractToolCalls()  return empty if no JSON
- Apply redactSecrets() to full JSON text instead of just the prompt field
- Fix event type matching: use SDK's tool.execution_start with data.toolName
- Fix testedAreas using a.name instead of a.area (property name bug)
- Update JSDoc to remove stale markdown fallback reference

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@banibrata-de
Copy link
Contributor Author

Addressed all review comments from @JasonYeMSFT in the latest push (15558d9):

  1. Removed markdown fallback entirely extractToolCallsFromMarkdown() and all regex parsing deleted. Report now only reads agent-metadata.json.
  2. redactSecrets on full JSON Changed to redactSecrets(JSON.stringify(jsonData, null, 2)) so the entire JSON text is redacted, not just the prompt field.
  3. Fixed event type matching Updated to use SDK actual tool.execution_start with data.toolName instead of assistant.tool_call.
  4. Fixed a.area to a.name Bug fix in coverage: area objects use name property, not area.
  5. Thread safety Acknowledged; each test writes to its own directory so there is no file-level contention.

Net result: -77 lines (29 added, 106 removed).

@JasonYeMSFT
Copy link
Member

@banibrata-de You also need to agree to the Contributor License Agreement.

@banibrata-de
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Microsoft"

@kvenkatrajan kvenkatrajan merged commit 2b331d3 into microsoft:main Mar 3, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants