Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Evals] --verbose flag and eval metric statistics #1631

Open
ssbushi opened this issue Jan 21, 2025 · 1 comment
Open

[Evals] --verbose flag and eval metric statistics #1631

ssbushi opened this issue Jan 21, 2025 · 1 comment
Assignees

Comments

@ssbushi
Copy link
Contributor

ssbushi commented Jan 21, 2025

Is your feature request related to a problem? Please describe.
Evals CLI commands should have option to dump output to STDOUT, and show general aggregate metrics

Additional context
Filed on behalf of @jacobsimionato

I think a --verbose flag would be great. I also think you should print summary statistics (e.g. mean or p50 scores for each evaluator type) for every run, though, because that adds minimal log spam and it creates a nice confirmation that the evaluator truly did run and produce meaningful results. Also, add a URL in the console to open the results for that run in the Genkit console if it is running.

Having spent some time working with the Gen UI codebase that has 50-100 eval data points (it doesn't use genkit), a common workflow is like:

  1. Make change to code
  2. Run eval pipeline
  3. Check if summary metrics are better or worse.
  4. (optional) Investigate traces to understand what changed

Very often, you are just trying all kinds of tweaks and doing only steps 1-3, so the summary metrics are enough.

@ssbushi
Copy link
Contributor Author

ssbushi commented Jan 21, 2025

My thoughts,

  • We should add the URL to view the results. This is now possible since the runtime is already up and running.
  • Summary metrics are useful, this needs more thought on it. Not all metrics can be summarized by calculating mean, eg: enum scores, boolean scores, string scores are all valid in Genkit. We can start with support for numeric scores and improving as we go.
  • --verbose flag that outputs the results to STDOUT is a nice-to-have.

@ssbushi ssbushi self-assigned this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant