Skip to content

Fix Action Completion metric broken by platform change#13

Open
conorbronsdon wants to merge 2 commits intorungalileo:mainfrom
conorbronsdon:fix/action-completion-metric
Open

Fix Action Completion metric broken by platform change#13
conorbronsdon wants to merge 2 commits intorungalileo:mainfrom
conorbronsdon:fix/action-completion-metric

Conversation

@conorbronsdon
Copy link
Copy Markdown
Contributor

@conorbronsdon conorbronsdon commented Mar 13, 2026

User description

Summary

  • Updates config.py to use the custom "Action Completion - Agent Leaderboard" metric name, as recommended by @bhavsarpratik in Missing metric: Action Completion #12
  • Makes fetch_results.py resilient to the metric key change by trying multiple possible aggregate key formats with a graceful fallback (mirrors the existing pattern for tool_selection_quality)

Context

The agentic_session_success metric was moved from experiment-level to trace-level only in a Galileo platform update. This broke Action Completion scoring for all new evaluation runs. The fix follows the workaround documented in #12: duplicate the metric at the trace level and reference it by its new custom name.

Note: The exact aggregate property key returned by the Galileo SDK for the new custom metric name hasn't been verified end-to-end. The code tries three plausible key formats. Once confirmed with a live run, the fallback keys can be trimmed.

Closes #12

Test plan

  • Run a new experiment with the updated config and verify Action Completion scores appear
  • Verify fetch_results.py correctly picks up the new metric key from aggregate properties

🤖 Generated with Claude Code


Generated description

Below is a concise technical summary of the changes proposed in this PR:
Update the evaluation config and experiment runners to request the custom Action Completion leaderboard metric via METRICS, ensuring simulated experiments run with the new name. Handle the metric key change in fetch_results.py by checking multiple aggregate property formats so the Action Completion score continues to be reported.

TopicDetails
Fetch Resilience Handle the action completion key change in fetch_results.py by trying multiple aggregate property names before reporting the metric, and log when it is absent.
Modified files (1)
  • v2/results/fetch_results.py
Latest Contributors(1)
UserCommitDate
pratik@galileo.aiadded-more-models-and-...August 13, 2025
Experiment Metrics Align experiment runners and config to request METRICS, which now references the custom Action Completion leaderboard metric so all runs use the updated name.
Modified files (4)
  • v2/evaluate/config.py
  • v2/evaluate/run_experiment.py
  • v2/evaluate/run_parallel_experiments.py
  • v2/evaluate/simple_test.py
Latest Contributors(1)
UserCommitDate
pratik@galileo.aiprompt-changes-error-h...June 30, 2025
This pull request is reviewed by Baz. Review like a pro on (Baz).

The `agentic_session_success` metric was moved from experiment-level to
trace-level only in a Galileo platform update, breaking the Action
Completion scoring for new evaluation runs.

Update config.py to use the custom "Action Completion - Agent Leaderboard"
metric (duplicated at the trace level) as recommended in rungalileo#12. Also make
fetch_results.py resilient to the metric key name change by trying
multiple possible key formats, with a graceful fallback.

Closes rungalileo#12

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@conorbronsdon conorbronsdon marked this pull request as ready for review March 13, 2026 06:14
Comment on lines +217 to +228
# Look up the action completion metric by key.
# The key changed from 'average_agentic_session_success' to a
# custom metric name after a Galileo platform update (see #12).
action_completion = None
props = exp.aggregate_metrics.additional_properties
for key in ("average_action_completion___agent_leaderboard",
"average_Action Completion - Agent Leaderboard",
"average_agentic_session_success"):
if key in props:
action_completion = round(props[key], 2)
break
if action_completion is None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing metric is reported only via print(f"Action completion metric not found in experiment {exp.name}, proceeding without it"), so the message bypasses our structured logging (LoggingConfig.get_logger()) and won't be captured or monitored. Can we replace this print (and any user-visible diagnostics in this block) with the configured logger, e.g. logger.warning(...)?

Finding type: Awesome Galileo Reviewer | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In v2/results/fetch_results.py around lines 217 to 229, the block inside
process_experiment uses print(...) to log the missing action completion metric. Replace
print(f"Action completion metric not found in experiment {exp.name}, proceeding without
it") with a structured logger call (e.g., logger.warning(...)). If a module-level logger
variable is not available, obtain one via LoggingConfig.get_logger() (or the project’s
configured logger) at the top of the file and reuse it. Also ensure any other
user-visible diagnostics in this block use logger.warning or logger.info instead of
print so the absence of metrics is recorded in centralized logs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping this one — the entire codebase uses print() for diagnostics, so switching just this spot to a logger would be inconsistent. Can be a separate cleanup.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood—keeping it consistent with the rest of the codebase makes sense for now, and deferring logger cleanup to a separate pass is reasonable.

Comment thread v2/evaluate/config.py
The --metrics defaults in run_experiment.py and run_parallel_experiments.py
were hardcoded to the old metric name, so the new Action Completion metric
was never requested unless --metrics was explicitly passed. Import from
config.METRICS so the runners stay in sync automatically.

Also updates simple_test.py to use the new metric name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@conorbronsdon
Copy link
Copy Markdown
Contributor Author

@bhavsarpratik could you take a quick look at this? Should be easy solve of ongoing issue

@galileo-automation
Copy link
Copy Markdown

No activity for 30 days — this PR will be closed in 5 days unless updated.

1 similar comment
@galileo-automation
Copy link
Copy Markdown

No activity for 30 days — this PR will be closed in 5 days unless updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing metric: Action Completion

2 participants