-
Notifications
You must be signed in to change notification settings - Fork 8k
feat(actions): add swebench eval harness to github actions #8028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Size Change: -2 B (0%) Total Size: 13.6 MB ℹ️ View Unchanged
|
Successfully ran eval harness here: https://github.com/google-gemini/gemini-cli/actions/runs/17559259950 |
…gemini-cli into agmsb/eval-runner
…gemini-cli into agmsb/eval-runner
Code Coverage Summary
CLI Package - Full Text Report
Core Package - Full Text Report
For detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us wait for Matt to take a look as well.
Sorry for some of the basic questions but:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below, I think we need to make this clearer in how its used and what it does and where the code comes from and how we see the results.
Not a problem addressing them one by one below.
The eval harness is written in Python and uses Poetry for dependency/package management.
The harness will spawn 10 concurrent async.io Tasks, but the harness itself will still run in a single container. Each Task will run Gemini CLI with npx, setting up the working directory and passing in the prompt. This will all run on the GitHub Action runner.
Yes, I think we'll want to do follow up with this quickly after this PR.
Currently this action in this PR writes out the resolve/unresolved count to stdout. For run-level summaries in a follow-up PR we will log them to stdout so it can be viewable in the GitHub Action UI. For per-eval instance logs and agent trajectories, in a follow-up PR we will upload these to Google Cloud Storage and write out to stdout a link to the GCS bucket. |
TLDR
We want to be able to run evaluations to catch regressions on changes using GitHub Actions. This PR adds an action that runs a Docker container with a swe-bench evaluation harness baked into the container image. The purpose of this is to enable adhoc evaluations natively in the Gemini CLI OSS repository.
Dive Deeper
In this PR, running the GitHub action will run a sample of 20 instances from Swe-bench Lite with gemini-cli, and upon completion of work, will upload the outputs from Gemini-CLI to be scored for pass/fail.
Currently, this action will run gemini-cli from HEAD on main branch; we will need to update the harness to allow for running on the branch it was dispatched from in GitHub Actions immediately after this PR.
The workflow is as follows:
Reviewer Test Plan
Eval
.Run workflow
.agmsb/eval-runner
branch.Testing Matrix
Linked issues / bugs
This makes progress on #4578.