Evaluation analysis notebook #59

sashimono-san · 2025-10-23T10:47:12Z

Purpose

The main goal of the notebook is to give some insights on bulk evaluation explanations, especially for when one has too many evaluation samples for manual analysis.

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[x] Other... Please describe:

It's an exploratory notebook, showing options on how people can analyze bulk results.

What to Check

Primarily: notebook story.

I feel it's too much code for the notebook, especially towards the end, close to the visualization part. Our example results are not ideal too, as we did not find much valuable information. As discussed, this primarily highlights an issue with the model-as-judge approaches, that have little to no direction on what to evaluate with the demo example we have.

frank-msft · 2025-10-23T17:45:40Z

@sashimono-san The notebook is too large to be viewed in the browser, so I'm leaving comments here:

Consider removing the labels on the cluster plot. It's difficult to see the clusters.
What's the conclusion of the analysis? Is the goal of this analysis to demonstrate gpt5-mini is better than gpt5-nano at certain tasks?

sashimono-san · 2025-10-28T13:47:11Z

Hi @frank-msft , thanks for taking a look. I'll remove the labels on the plot. I think we could benefit of only having one or two visualizations.

The general goal is to enable comparison and understanding of free text evaluation explanations. Especially when we have a large number of evaluation results. In our case we were trying to help understand how different models performed as orchestrators, but the same idea can be applied to analyse/compare different agent prompts, models or anything else that may affect how the system behaves.

And you are right, we are indeed comparing model based on their performance in certain tasks. We do this by extracting factual observations from the evaluation, and analyse common topics for each model.

As for conclusions, based on the existing metrics, there's no big difference between gpt5-mini and gpt5-nano. Subjectively, I would say gpt5-mini has an advantage, as the evaluation results highlight it did a good job integrating information from multiple agents and dealing with discrepancies. (This we can see in the "positive and negative topic tables for each model)

But it also highlights how inconsistent these model-as-judge metrics can be. We do not define strong criteria to guide the evaluations, so the explanations vary a lot, and as a result we get very different topics in the evaluation results of each model. We mitigated this to some extent by grouping topics from all evaluation results, and comparing models side by side. This "dilutes" the individual strengths and weaknesses, but make a fairer comparison. (This is where the visualizations in the last cell come in)

sashimono-san mentioned this pull request Oct 23, 2025

Notebook for evaluating evaluation results #58

Closed

improve: evaluation analysis notebook

9f49952

sashimono-san force-pushed the leo/eval_analysis branch from b26d8a2 to 9f49952 Compare October 23, 2025 10:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation analysis notebook #59

Evaluation analysis notebook #59

Uh oh!

sashimono-san commented Oct 23, 2025

Uh oh!

frank-msft commented Oct 23, 2025

Uh oh!

sashimono-san commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Evaluation analysis notebook #59

Are you sure you want to change the base?

Evaluation analysis notebook #59

Uh oh!

Conversation

sashimono-san commented Oct 23, 2025

Purpose

Does this introduce a breaking change?

Pull Request Type

What to Check

Uh oh!

frank-msft commented Oct 23, 2025

Uh oh!

sashimono-san commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants