Fix Action Completion metric broken by platform change#13
Fix Action Completion metric broken by platform change#13conorbronsdon wants to merge 2 commits intorungalileo:mainfrom
Conversation
The `agentic_session_success` metric was moved from experiment-level to trace-level only in a Galileo platform update, breaking the Action Completion scoring for new evaluation runs. Update config.py to use the custom "Action Completion - Agent Leaderboard" metric (duplicated at the trace level) as recommended in rungalileo#12. Also make fetch_results.py resilient to the metric key name change by trying multiple possible key formats, with a graceful fallback. Closes rungalileo#12 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| # Look up the action completion metric by key. | ||
| # The key changed from 'average_agentic_session_success' to a | ||
| # custom metric name after a Galileo platform update (see #12). | ||
| action_completion = None | ||
| props = exp.aggregate_metrics.additional_properties | ||
| for key in ("average_action_completion___agent_leaderboard", | ||
| "average_Action Completion - Agent Leaderboard", | ||
| "average_agentic_session_success"): | ||
| if key in props: | ||
| action_completion = round(props[key], 2) | ||
| break | ||
| if action_completion is None: |
There was a problem hiding this comment.
Missing metric is reported only via print(f"Action completion metric not found in experiment {exp.name}, proceeding without it"), so the message bypasses our structured logging (LoggingConfig.get_logger()) and won't be captured or monitored. Can we replace this print (and any user-visible diagnostics in this block) with the configured logger, e.g. logger.warning(...)?
Finding type: Awesome Galileo Reviewer | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
In v2/results/fetch_results.py around lines 217 to 229, the block inside
process_experiment uses print(...) to log the missing action completion metric. Replace
print(f"Action completion metric not found in experiment {exp.name}, proceeding without
it") with a structured logger call (e.g., logger.warning(...)). If a module-level logger
variable is not available, obtain one via LoggingConfig.get_logger() (or the project’s
configured logger) at the top of the file and reuse it. Also ensure any other
user-visible diagnostics in this block use logger.warning or logger.info instead of
print so the absence of metrics is recorded in centralized logs.
There was a problem hiding this comment.
Skipping this one — the entire codebase uses print() for diagnostics, so switching just this spot to a logger would be inconsistent. Can be a separate cleanup.
There was a problem hiding this comment.
Understood—keeping it consistent with the rest of the codebase makes sense for now, and deferring logger cleanup to a separate pass is reasonable.
The --metrics defaults in run_experiment.py and run_parallel_experiments.py were hardcoded to the old metric name, so the new Action Completion metric was never requested unless --metrics was explicitly passed. Import from config.METRICS so the runners stay in sync automatically. Also updates simple_test.py to use the new metric name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@bhavsarpratik could you take a quick look at this? Should be easy solve of ongoing issue |
|
No activity for 30 days — this PR will be closed in 5 days unless updated. |
1 similar comment
|
No activity for 30 days — this PR will be closed in 5 days unless updated. |
User description
Summary
config.pyto use the custom"Action Completion - Agent Leaderboard"metric name, as recommended by @bhavsarpratik in Missing metric: Action Completion #12fetch_results.pyresilient to the metric key change by trying multiple possible aggregate key formats with a graceful fallback (mirrors the existing pattern fortool_selection_quality)Context
The
agentic_session_successmetric was moved from experiment-level to trace-level only in a Galileo platform update. This broke Action Completion scoring for all new evaluation runs. The fix follows the workaround documented in #12: duplicate the metric at the trace level and reference it by its new custom name.Note: The exact aggregate property key returned by the Galileo SDK for the new custom metric name hasn't been verified end-to-end. The code tries three plausible key formats. Once confirmed with a live run, the fallback keys can be trimmed.
Closes #12
Test plan
fetch_results.pycorrectly picks up the new metric key from aggregate properties🤖 Generated with Claude Code
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Update the evaluation config and experiment runners to request the custom Action Completion leaderboard metric via
METRICS, ensuring simulated experiments run with the new name. Handle the metric key change infetch_results.pyby checking multiple aggregate property formats so the Action Completion score continues to be reported.fetch_results.pyby trying multiple aggregate property names before reporting the metric, and log when it is absent.Modified files (1)
Latest Contributors(1)
METRICS, which now references the custom Action Completion leaderboard metric so all runs use the updated name.Modified files (4)
Latest Contributors(1)