Fix Action Completion metric broken by platform change by conorbronsdon · Pull Request #13 · rungalileo/agent-leaderboard

conorbronsdon · 2026-03-13T06:06:41Z

User description

Summary

Updates config.py to use the custom "Action Completion - Agent Leaderboard" metric name, as recommended by @bhavsarpratik in Missing metric: Action Completion #12
Makes fetch_results.py resilient to the metric key change by trying multiple possible aggregate key formats with a graceful fallback (mirrors the existing pattern for tool_selection_quality)

Context

The agentic_session_success metric was moved from experiment-level to trace-level only in a Galileo platform update. This broke Action Completion scoring for all new evaluation runs. The fix follows the workaround documented in #12: duplicate the metric at the trace level and reference it by its new custom name.

Note: The exact aggregate property key returned by the Galileo SDK for the new custom metric name hasn't been verified end-to-end. The code tries three plausible key formats. Once confirmed with a live run, the fallback keys can be trimmed.

Closes #12

Test plan

Run a new experiment with the updated config and verify Action Completion scores appear
Verify fetch_results.py correctly picks up the new metric key from aggregate properties

🤖 Generated with Claude Code

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Update the evaluation config and experiment runners to request the custom Action Completion leaderboard metric via METRICS, ensuring simulated experiments run with the new name. Handle the metric key change in fetch_results.py by checking multiple aggregate property formats so the Action Completion score continues to be reported.

Topic Details

Fetch Resilience

Handle the action completion key change in fetch_results.py by trying multiple aggregate property names before reporting the metric, and log when it is absent.

Modified files (1)

v2/results/fetch_results.py

Latest Contributors(1)

User	Commit	Date
pratik@galileo.ai	added-more-models-and-...	August 13, 2025

Experiment Metrics

Align experiment runners and config to request METRICS, which now references the custom Action Completion leaderboard metric so all runs use the updated name.

Modified files (4)

v2/evaluate/config.py
v2/evaluate/run_experiment.py
v2/evaluate/run_parallel_experiments.py
v2/evaluate/simple_test.py

Latest Contributors(1)

User	Commit	Date
pratik@galileo.ai	prompt-changes-error-h...	June 30, 2025

This pull request is reviewed by Baz. Review like a pro on (Baz).

The `agentic_session_success` metric was moved from experiment-level to trace-level only in a Galileo platform update, breaking the Action Completion scoring for new evaluation runs. Update config.py to use the custom "Action Completion - Agent Leaderboard" metric (duplicated at the trace level) as recommended in rungalileo#12. Also make fetch_results.py resilient to the metric key name change by trying multiple possible key formats, with a graceful fallback. Closes rungalileo#12 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

baz-reviewer · 2026-03-13T06:21:10Z

+                    # Look up the action completion metric by key.
+                    # The key changed from 'average_agentic_session_success' to a
+                    # custom metric name after a Galileo platform update (see #12).
+                    action_completion = None
+                    props = exp.aggregate_metrics.additional_properties
+                    for key in ("average_action_completion___agent_leaderboard",
+                                "average_Action Completion - Agent Leaderboard",
+                                "average_agentic_session_success"):
+                        if key in props:
+                            action_completion = round(props[key], 2)
+                            break
+                    if action_completion is None:


Missing metric is reported only via print(f"Action completion metric not found in experiment {exp.name}, proceeding without it"), so the message bypasses our structured logging (LoggingConfig.get_logger()) and won't be captured or monitored. Can we replace this print (and any user-visible diagnostics in this block) with the configured logger, e.g. logger.warning(...)?

_{Finding type: Awesome Galileo Reviewer | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents:

In v2/results/fetch_results.py around lines 217 to 229, the block inside process_experiment uses print(...) to log the missing action completion metric. Replace print(f"Action completion metric not found in experiment {exp.name}, proceeding without it") with a structured logger call (e.g., logger.warning(...)). If a module-level logger variable is not available, obtain one via LoggingConfig.get_logger() (or the project’s configured logger) at the top of the file and reuse it. Also ensure any other user-visible diagnostics in this block use logger.warning or logger.info instead of print so the absence of metrics is recorded in centralized logs.

Skipping this one — the entire codebase uses print() for diagnostics, so switching just this spot to a logger would be inconsistent. Can be a separate cleanup.

Understood—keeping it consistent with the rest of the codebase makes sense for now, and deferring logger cleanup to a separate pass is reasonable.

The --metrics defaults in run_experiment.py and run_parallel_experiments.py were hardcoded to the old metric name, so the new Action Completion metric was never requested unless --metrics was explicitly passed. Import from config.METRICS so the runners stay in sync automatically. Also updates simple_test.py to use the new metric name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

conorbronsdon · 2026-03-28T05:27:03Z

@bhavsarpratik could you take a quick look at this? Should be easy solve of ongoing issue

galileo-automation · 2026-05-05T02:25:56Z

No activity for 30 days — this PR will be closed in 5 days unless updated.

galileo-automation · 2026-05-06T02:24:21Z

No activity for 30 days — this PR will be closed in 5 days unless updated.

conorbronsdon marked this pull request as ready for review March 13, 2026 06:14

baz-reviewer Bot reviewed Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Action Completion metric broken by platform change#13

Fix Action Completion metric broken by platform change#13
conorbronsdon wants to merge 2 commits intorungalileo:mainfrom
conorbronsdon:fix/action-completion-metric

conorbronsdon commented Mar 13, 2026 •

edited by baz-reviewer Bot

Loading

Uh oh!

baz-reviewer Bot Mar 13, 2026

Uh oh!

conorbronsdon Mar 13, 2026

Uh oh!

baz-reviewer Bot Mar 13, 2026

Uh oh!

Uh oh!

conorbronsdon commented Mar 28, 2026

Uh oh!

galileo-automation commented May 5, 2026

Uh oh!

galileo-automation commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

conorbronsdon commented Mar 13, 2026 • edited by baz-reviewer Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary

Context

Test plan

Generated description

Uh oh!

baz-reviewer Bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

conorbronsdon Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

conorbronsdon commented Mar 28, 2026

Uh oh!

galileo-automation commented May 5, 2026

Uh oh!

galileo-automation commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

conorbronsdon commented Mar 13, 2026 •

edited by baz-reviewer Bot

Loading