Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion v2/evaluate/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

METRICS = [
"tool_selection_quality",
"agentic_session_success",
"Action Completion - Agent Leaderboard",
]
Comment thread
conorbronsdon marked this conversation as resolved.

FILE_PATHS = {
Expand Down
3 changes: 2 additions & 1 deletion v2/evaluate/run_experiment.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import argparse
from typing import List
from simulation import run_simulation_experiments
from config import METRICS
from dotenv import load_dotenv

load_dotenv("../.env")
Expand Down Expand Up @@ -56,7 +57,7 @@ def main():
parser.add_argument(
"--metrics",
type=str,
default="tool_selection_quality,agentic_session_success",
default=",".join(METRICS),
help="Comma-separated list of metrics to evaluate",
)

Expand Down
3 changes: 2 additions & 1 deletion v2/evaluate/run_parallel_experiments.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import itertools
from typing import List, Tuple
from simulation import run_simulation_experiments
from config import METRICS
from dotenv import load_dotenv
import time

Expand Down Expand Up @@ -108,7 +109,7 @@ def main():
parser.add_argument(
"--metrics",
type=str,
default="tool_selection_quality,agentic_session_success",
default=",".join(METRICS),
help="Comma-separated list of metrics to evaluate",
)

Expand Down
2 changes: 1 addition & 1 deletion v2/evaluate/simple_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,7 +321,7 @@ def generate_response(
experiment_name = f"weather-conversation-experiment-{int(time.time() * 1000000)}"
METRICS = [
"tool_selection_quality",
"agentic_session_success",
"Action Completion - Agent Leaderboard",
]

# No galileo_context here - we're creating new loggers for each turn
Expand Down
16 changes: 15 additions & 1 deletion v2/results/fetch_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,10 +214,24 @@ def process_experiment(exp, model):
else:
print(f"average_tool_selection_quality not found in experiment {exp.name}, proceeding without it")

# Look up the action completion metric by key.
# The key changed from 'average_agentic_session_success' to a
# custom metric name after a Galileo platform update (see #12).
action_completion = None
props = exp.aggregate_metrics.additional_properties
for key in ("average_action_completion___agent_leaderboard",
"average_Action Completion - Agent Leaderboard",
"average_agentic_session_success"):
if key in props:
action_completion = round(props[key], 2)
break
if action_completion is None:
Comment on lines +217 to +228
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing metric is reported only via print(f"Action completion metric not found in experiment {exp.name}, proceeding without it"), so the message bypasses our structured logging (LoggingConfig.get_logger()) and won't be captured or monitored. Can we replace this print (and any user-visible diagnostics in this block) with the configured logger, e.g. logger.warning(...)?

Finding type: Awesome Galileo Reviewer | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In v2/results/fetch_results.py around lines 217 to 229, the block inside
process_experiment uses print(...) to log the missing action completion metric. Replace
print(f"Action completion metric not found in experiment {exp.name}, proceeding without
it") with a structured logger call (e.g., logger.warning(...)). If a module-level logger
variable is not available, obtain one via LoggingConfig.get_logger() (or the project’s
configured logger) at the top of the file and reuse it. Also ensure any other
user-visible diagnostics in this block use logger.warning or logger.info instead of
print so the absence of metrics is recorded in centralized logs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping this one — the entire codebase uses print() for diagnostics, so switching just this spot to a logger would be inconsistent. Can be a separate cleanup.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood—keeping it consistent with the rest of the codebase makes sense for now, and deferring logger cleanup to a separate pass is reasonable.

print(f"Action completion metric not found in experiment {exp.name}, proceeding without it")

result = {
'experiment_name': exp.name,
'total_responses': exp.aggregate_metrics.additional_properties['total_responses'],
'average_action_completion': round(exp.aggregate_metrics.additional_properties['average_agentic_session_success'], 2),
'average_action_completion': action_completion,
'average_tool_selection_quality': tool_selection_quality,
'model': final_model_name,
'category': category
Expand Down