Skip to content

Conversation

@sashimono-san
Copy link
Contributor

Purpose

The main goal of the notebook is to give some insights on bulk evaluation explanations, especially for when one has too many evaluation samples for manual analysis.

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[x] Other... Please describe:

It's an exploratory notebook, showing options on how people can analyze bulk results.

What to Check

Primarily: notebook story.

I feel it's too much code for the notebook, especially towards the end, close to the visualization part. Our example results are not ideal too, as we did not find much valuable information. As discussed, this primarily highlights an issue with the model-as-judge approaches, that have little to no direction on what to evaluate with the demo example we have.

@frank-msft
Copy link
Contributor

@sashimono-san The notebook is too large to be viewed in the browser, so I'm leaving comments here:

  • Consider removing the labels on the cluster plot. It's difficult to see the clusters.
  • What's the conclusion of the analysis? Is the goal of this analysis to demonstrate gpt5-mini is better than gpt5-nano at certain tasks?

@sashimono-san
Copy link
Contributor Author

Hi @frank-msft , thanks for taking a look. I'll remove the labels on the plot. I think we could benefit of only having one or two visualizations.

The general goal is to enable comparison and understanding of free text evaluation explanations. Especially when we have a large number of evaluation results. In our case we were trying to help understand how different models performed as orchestrators, but the same idea can be applied to analyse/compare different agent prompts, models or anything else that may affect how the system behaves.

And you are right, we are indeed comparing model based on their performance in certain tasks. We do this by extracting factual observations from the evaluation, and analyse common topics for each model.

As for conclusions, based on the existing metrics, there's no big difference between gpt5-mini and gpt5-nano. Subjectively, I would say gpt5-mini has an advantage, as the evaluation results highlight it did a good job integrating information from multiple agents and dealing with discrepancies. (This we can see in the "positive and negative topic tables for each model)

But it also highlights how inconsistent these model-as-judge metrics can be. We do not define strong criteria to guide the evaluations, so the explanations vary a lot, and as a result we get very different topics in the evaluation results of each model. We mitigated this to some extent by grouping topics from all evaluation results, and comparing models side by side. This "dilutes" the individual strengths and weaknesses, but make a fairer comparison. (This is where the visualizations in the last cell come in)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants