[Evals] `--verbose` flag and eval metric statistics #1631

ssbushi · 2025-01-21T20:57:02Z

Is your feature request related to a problem? Please describe.
Evals CLI commands should have option to dump output to STDOUT, and show general aggregate metrics

Additional context
Filed on behalf of @jacobsimionato

I think a --verbose flag would be great. I also think you should print summary statistics (e.g. mean or p50 scores for each evaluator type) for every run, though, because that adds minimal log spam and it creates a nice confirmation that the evaluator truly did run and produce meaningful results. Also, add a URL in the console to open the results for that run in the Genkit console if it is running.

Having spent some time working with the Gen UI codebase that has 50-100 eval data points (it doesn't use genkit), a common workflow is like:

Make change to code

Run eval pipeline

Check if summary metrics are better or worse.

(optional) Investigate traces to understand what changed

Very often, you are just trying all kinds of tweaks and doing only steps 1-3, so the summary metrics are enough.

ssbushi · 2025-01-21T21:39:06Z

My thoughts,

We should add the URL to view the results. This is now possible since the runtime is already up and running.
Summary metrics are useful, this needs more thought on it. Not all metrics can be summarized by calculating mean, eg: enum scores, boolean scores, string scores are all valid in Genkit. We can start with support for numeric scores and improving as we go.
--verbose flag that outputs the results to STDOUT is a nice-to-have.

github-project-automation bot added this to Genkit Backlog Jan 21, 2025

ssbushi self-assigned this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evals] `--verbose` flag and eval metric statistics #1631

[Evals] `--verbose` flag and eval metric statistics #1631

ssbushi commented Jan 21, 2025

ssbushi commented Jan 21, 2025

[Evals] --verbose flag and eval metric statistics #1631

[Evals] --verbose flag and eval metric statistics #1631

Comments

ssbushi commented Jan 21, 2025

ssbushi commented Jan 21, 2025

[Evals] `--verbose` flag and eval metric statistics #1631

[Evals] `--verbose` flag and eval metric statistics #1631