Skip to content

Conversation

agmsb
Copy link
Contributor

@agmsb agmsb commented Sep 8, 2025

TLDR

We want to be able to run evaluations to catch regressions on changes using GitHub Actions. This PR adds an action that runs a Docker container with a swe-bench evaluation harness baked into the container image. The purpose of this is to enable adhoc evaluations natively in the Gemini CLI OSS repository.

Dive Deeper

In this PR, running the GitHub action will run a sample of 20 instances from Swe-bench Lite with gemini-cli, and upon completion of work, will upload the outputs from Gemini-CLI to be scored for pass/fail.

Currently, this action will run gemini-cli from HEAD on main branch; we will need to update the harness to allow for running on the branch it was dispatched from in GitHub Actions immediately after this PR.

The workflow is as follows:

  1. Script runs to download swebench data.
  2. The eval harness will spin up 10 concurrent workers, and each worker will read the data for one of the 20 samples from Swe-bench Lite.
  3. Each worker will clone the working directory into a temp directory, assemble a prompt for Gemini CLI based on the instance metadata, and run the Gemini-CLI via npx in that temp directory with the provided prompt. During this execution, telemetry is collected locally.
  4. Upon completion of work, the harness will generate a diff patch from Gemini CLI's work, and will upload these along with instance metadata to GCS.
  5. Outside of this workflow, we run a continuous scorer polling GCS for these diffs, running Swe-bench Lite's scoring mechanism, and uploading results back to GCS.
  6. The harness will read the results and print out the number of passes and/or failures.

Reviewer Test Plan

  1. Go to GitHub Actions.
  2. Navigate to Eval.
  3. Click on Run workflow.
  4. Select agmsb/eval-runner branch.
  5. Let the workflow execute.

Testing Matrix

🍏 🪟 🐧
npm run
npx
Docker
Podman - -
Seatbelt - -

Linked issues / bugs

This makes progress on #4578.

Copy link

github-actions bot commented Sep 8, 2025

Size Change: -2 B (0%)

Total Size: 13.6 MB

ℹ️ View Unchanged
Filename Size Change
./bundle/gemini.js 13.6 MB -2 B (0%)
./bundle/sandbox-macos-permissive-closed.sb 1.03 kB 0 B
./bundle/sandbox-macos-permissive-open.sb 830 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-closed.sb 3.29 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B

compressed-size-action

@gemini-cli gemini-cli bot added area/quality Focuses on testing, reliability, performance, and overall product quality. area/models Support for Gemini models, multi-modality, local execution, and performance tuning. maintainer For Roadmap Items labels Sep 8, 2025
@agmsb
Copy link
Contributor Author

agmsb commented Sep 8, 2025

Successfully ran eval harness here: https://github.com/google-gemini/gemini-cli/actions/runs/17559259950

@agmsb agmsb marked this pull request as ready for review September 8, 2025 21:20
@agmsb agmsb requested a review from a team as a code owner September 8, 2025 21:20
Copy link

github-actions bot commented Sep 8, 2025

Code Coverage Summary

Package Lines Statements Functions Branches
CLI 75.64% 75.64% 77.07% 80.05%
Core 78.89% 78.89% 78.21% 84.56%
CLI Package - Full Text Report
-------------------|---------|----------|---------|---------|-------------------
File               | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s 
-------------------|---------|----------|---------|---------|-------------------
All files          |   75.64 |    80.05 |   77.07 |   75.64 |                   
 src               |   61.88 |    63.38 |   53.33 |   61.88 |                   
  gemini.tsx       |   49.49 |       40 |      50 |   49.49 | ...70,481-489,503 
  ...ractiveCli.ts |   90.38 |    73.68 |   33.33 |   90.38 | 36-39,58-61,88-90 
  ...ActiveAuth.ts |     100 |      100 |     100 |     100 |                   
 src/commands      |   70.45 |      100 |      25 |   70.45 |                   
  extensions.tsx   |   55.55 |      100 |       0 |   55.55 | 21-31,35          
  mcp.ts           |   94.11 |      100 |      50 |   94.11 | 26                
 ...nds/extensions |   48.42 |    76.19 |   32.14 |   48.42 |                   
  disable.ts       |    28.2 |      100 |       0 |    28.2 | 17-27,33-44,46-50 
  enable.ts        |    23.4 |      100 |       0 |    23.4 | 17-35,41-52,54-58 
  install.ts       |   67.12 |     37.5 |   66.66 |   67.12 | ...62-64,86,89-93 
  link.ts          |   28.57 |      100 |       0 |   28.57 | 19-33,40-45,47-50 
  list.ts          |   32.14 |      100 |       0 |   32.14 | 11-27,34-35       
  new.ts           |     100 |      100 |     100 |     100 |                   
  uninstall.ts     |   45.71 |      100 |   33.33 |   45.71 | 15-23,35-40,43-46 
  update.ts        |   19.64 |      100 |       0 |   19.64 | 21-46,53-68,70-74 
 ...les/mcp-server |       0 |        0 |       0 |       0 |                   
  example.ts       |       0 |        0 |       0 |       0 | 1-60              
 src/commands/mcp  |   95.62 |       80 |    90.9 |   95.62 |                   
  add.ts           |    97.4 |    83.33 |     100 |    97.4 | 109-112,119       
  list.ts          |   90.56 |    80.76 |      80 |   90.56 | ...07-109,134-135 
  remove.ts        |     100 |    66.66 |     100 |     100 | 19-23             
 src/config        |      92 |    85.02 |    87.5 |      92 |                   
  auth.ts          |     100 |      100 |     100 |     100 |                   
  config.ts        |   97.21 |    90.32 |      80 |   97.21 | ...08,550,669-673 
  extension.ts     |    79.1 |    82.88 |   81.48 |    79.1 | ...47-551,606-615 
  keyBindings.ts   |     100 |      100 |     100 |     100 |                   
  sandboxConfig.ts |   54.16 |    23.07 |   66.66 |   54.16 | ...44,54-68,73-89 
  settings.ts      |   88.88 |    82.92 |   95.23 |   88.88 | ...13,615,617-618 
  ...ingsSchema.ts |     100 |      100 |     100 |     100 |                   
  ...tedFolders.ts |   92.96 |    91.17 |     100 |   92.96 | ...56-157,170-175 
 ...fig/extensions |   92.85 |    82.35 |     100 |   92.85 |                   
  ...ableSchema.ts |     100 |      100 |     100 |     100 |                   
  variables.ts     |   90.69 |    82.35 |     100 |   90.69 | 30-31,64-65       
 src/core          |   81.63 |    42.85 |     100 |   81.63 |                   
  auth.ts          |   56.25 |       50 |     100 |   56.25 | 27-36             
  initializer.ts   |     100 |       50 |     100 |     100 | 37                
  theme.ts         |      80 |    33.33 |     100 |      80 | 18-19             
 src/generated     |     100 |      100 |     100 |     100 |                   
  git-commit.ts    |     100 |      100 |     100 |     100 |                   
 src/patches       |       0 |        0 |       0 |       0 |                   
  is-in-ci.ts      |       0 |        0 |       0 |       0 | 1-17              
 src/services      |   68.61 |    90.66 |   92.85 |   68.61 |                   
  ...mandLoader.ts |     100 |      100 |     100 |     100 |                   
  ...andService.ts |     100 |      100 |     100 |     100 |                   
  ...mandLoader.ts |   89.44 |    91.11 |     100 |   89.44 | ...85-190,273-280 
  ...omptLoader.ts |   29.68 |    81.25 |   66.66 |   29.68 | ...40-241,244-248 
  types.ts         |     100 |      100 |     100 |     100 |                   
 ...mpt-processors |   97.03 |     93.5 |     100 |   97.03 |                   
  ...tProcessor.ts |     100 |      100 |     100 |     100 |                   
  ...eProcessor.ts |   94.44 |    84.21 |     100 |   94.44 | 43-44,90-91       
  ...tionParser.ts |     100 |      100 |     100 |     100 |                   
  ...lProcessor.ts |   96.96 |    94.87 |     100 |   96.96 | 93-96             
  types.ts         |     100 |      100 |     100 |     100 |                   
 src/test-utils    |   91.22 |    83.33 |      80 |   91.22 |                   
  ...omMatchers.ts |   69.69 |       50 |      50 |   69.69 | 32-35,37-39,45-47 
  ...andContext.ts |     100 |      100 |     100 |     100 |                   
  render.tsx       |     100 |      100 |     100 |     100 |                   
 src/ui            |   71.03 |    69.74 |   61.53 |   71.03 |                   
  App.tsx          |     100 |      100 |     100 |     100 |                   
  AppContainer.tsx |   71.19 |     50.7 |   30.76 |   71.19 | ...78-831,863-866 
  ...tionNudge.tsx |    8.33 |      100 |       0 |    8.33 | 25-97             
  colors.ts        |   79.59 |      100 |   66.66 |   79.59 | ...43,45-46,48-49 
  constants.ts     |     100 |      100 |     100 |     100 |                   
  keyMatchers.ts   |   95.65 |    95.65 |     100 |   95.65 | 25-26             
  ...tic-colors.ts |     100 |      100 |     100 |     100 |                   
  textConstants.ts |     100 |      100 |     100 |     100 |                   
  types.ts         |     100 |      100 |     100 |     100 |                   
 src/ui/auth       |   59.92 |      100 |      50 |   59.92 |                   
  AuthDialog.tsx   |     100 |      100 |     100 |     100 |                   
  ...nProgress.tsx |   16.66 |      100 |       0 |   16.66 | 19-63             
  useAuth.ts       |      10 |      100 |       0 |      10 | 15-26,29-96       
 src/ui/commands   |   85.36 |    80.85 |   88.33 |   85.36 |                   
  aboutCommand.ts  |   95.65 |    58.33 |     100 |   95.65 | 52-53             
  authCommand.ts   |     100 |      100 |     100 |     100 |                   
  bugCommand.ts    |   79.48 |       40 |     100 |   79.48 | 33-36,78-87,93-94 
  chatCommand.ts   |   94.23 |    83.33 |     100 |   94.23 | ...11-212,214-215 
  clearCommand.ts  |     100 |      100 |     100 |     100 |                   
  ...essCommand.ts |     100 |    88.88 |     100 |     100 | 73                
  copyCommand.ts   |     100 |      100 |     100 |     100 |                   
  corgiCommand.ts  |     100 |      100 |     100 |     100 |                   
  ...ryCommand.tsx |   69.27 |    73.07 |     100 |   69.27 | ...25-126,161-169 
  docsCommand.ts   |     100 |      100 |     100 |     100 |                   
  editorCommand.ts |     100 |      100 |     100 |     100 |                   
  ...onsCommand.ts |     100 |      100 |     100 |     100 |                   
  helpCommand.ts   |     100 |      100 |     100 |     100 |                   
  ideCommand.ts    |   62.25 |       60 |   54.54 |   62.25 | ...52-266,274-288 
  initCommand.ts   |     100 |      100 |     100 |     100 |                   
  mcpCommand.ts    |   81.97 |    82.95 |   83.33 |   81.97 | ...85-386,436-443 
  memoryCommand.ts |   99.11 |    81.25 |     100 |   99.11 | 90                
  ...acyCommand.ts |     100 |      100 |     100 |     100 |                   
  quitCommand.ts   |     100 |      100 |     100 |     100 |                   
  ...oreCommand.ts |      92 |    87.09 |     100 |      92 | ...,82-87,128-129 
  ...ngsCommand.ts |     100 |      100 |     100 |     100 |                   
  ...hubCommand.ts |   83.66 |    66.66 |     100 |   83.66 | ...54-157,160-163 
  statsCommand.ts  |   84.48 |       75 |     100 |   84.48 | 25-33             
  ...tupCommand.ts |     100 |      100 |     100 |     100 |                   
  themeCommand.ts  |     100 |      100 |     100 |     100 |                   
  toolsCommand.ts  |     100 |      100 |     100 |     100 |                   
  types.ts         |     100 |      100 |     100 |     100 |                   
  vimCommand.ts    |   44.44 |      100 |       0 |   44.44 | 15-25             
 src/ui/components |   69.82 |    75.26 |   64.51 |   69.82 |                   
  AboutBox.tsx     |     100 |       50 |     100 |     100 | 104               
  AppHeader.tsx    |   36.36 |      100 |       0 |   36.36 | 19-35             
  AsciiArt.ts      |     100 |      100 |     100 |     100 |                   
  ...Indicator.tsx |   15.15 |      100 |       0 |   15.15 | 17-47             
  Composer.tsx     |   99.33 |    76.47 |     100 |   99.33 | 112               
  ...ryDisplay.tsx |   21.05 |      100 |       0 |   21.05 | 17-35             
  ...ryDisplay.tsx |   87.64 |    62.06 |     100 |   87.64 | ...48-49,79-84,89 
  ...geDisplay.tsx |     100 |      100 |     100 |     100 |                   
  ...gProfiler.tsx |      24 |      100 |       0 |      24 | 13-36             
  ...esDisplay.tsx |   10.52 |      100 |       0 |   10.52 | 24-82             
  ...ogManager.tsx |   13.54 |      100 |       0 |   13.54 | 29-183            
  ...ngsDialog.tsx |    6.99 |      100 |       0 |    6.99 | 30-181            
  ...ustDialog.tsx |     100 |      100 |     100 |     100 |                   
  Footer.tsx       |   81.89 |    79.41 |     100 |   81.89 | ...40-147,150-153 
  ...ngSpinner.tsx |   38.09 |      100 |       0 |   38.09 | 30-46             
  Header.tsx       |   87.23 |    57.14 |     100 |   87.23 | 36-39,55,64       
  Help.tsx         |   98.46 |       60 |     100 |   98.46 | 77,131            
  ...emDisplay.tsx |   71.01 |    43.75 |     100 |   71.01 | ...56-61,82-88,91 
  InputPrompt.tsx  |   78.43 |     77.9 |     100 |   78.43 | ...35-837,845-856 
  ...Indicator.tsx |     100 |      100 |     100 |     100 |                   
  MainContent.tsx  |   16.66 |      100 |       0 |   16.66 | 16-64             
  ...geDisplay.tsx |   25.92 |      100 |       0 |   25.92 | 15-37             
  ...tsDisplay.tsx |     100 |      100 |     100 |     100 |                   
  ...fications.tsx |   17.02 |      100 |       0 |   17.02 | 15-62             
  PrepareLabel.tsx |      50 |       20 |     100 |      50 | 28-30,35-48       
  ...otaDialog.tsx |     100 |      100 |     100 |     100 |                   
  ...ngDisplay.tsx |   23.07 |      100 |       0 |   23.07 | 13-37             
  ...ryDisplay.tsx |     100 |      100 |     100 |     100 |                   
  ...ngsDialog.tsx |   59.78 |    71.81 |      75 |   59.78 | ...26-727,761,772 
  ...ionDialog.tsx |    85.5 |      100 |   33.33 |    85.5 | 35-38,43-50       
  ...Indicator.tsx |   44.44 |      100 |       0 |   44.44 | 12-17             
  ...MoreLines.tsx |      28 |      100 |       0 |      28 | 18-40             
  StatsDisplay.tsx |    98.5 |    93.33 |     100 |    98.5 | 180-182           
  ...nsDisplay.tsx |   88.31 |     62.5 |     100 |   88.31 | 38-43,85,110-112  
  ThemeDialog.tsx  |      91 |    48.27 |      75 |      91 | ...13-114,156-158 
  Tips.tsx         |   19.23 |      100 |       0 |   19.23 | 17-45             
  ...tsDisplay.tsx |     100 |     87.5 |     100 |     100 | 30-31             
  ...ification.tsx |   36.36 |      100 |       0 |   36.36 | 15-22             
  ...ionDialog.tsx |    8.75 |      100 |       0 |    8.75 | 20-108            
 ...nents/messages |   80.82 |    86.39 |   57.14 |   80.82 |                   
  ...onMessage.tsx |   20.68 |      100 |       0 |   20.68 | 23-51             
  DiffRenderer.tsx |   94.13 |    86.59 |     100 |   94.13 | ...03,229-230,296 
  ErrorMessage.tsx |   22.22 |      100 |       0 |   22.22 | 16-31             
  ...niMessage.tsx |   18.75 |      100 |       0 |   18.75 | 21-49             
  ...geContent.tsx |   19.04 |      100 |       0 |   19.04 | 25-43             
  InfoMessage.tsx  |   26.31 |      100 |       0 |   26.31 | 17-32             
  ...onMessage.tsx |   83.04 |       80 |      40 |   83.04 | ...12-127,141-144 
  ...upMessage.tsx |   96.84 |    95.45 |     100 |   96.84 | 124-126           
  ToolMessage.tsx  |   88.35 |       80 |     100 |   88.35 | ...,93-97,175-177 
  UserMessage.tsx  |     100 |      100 |     100 |     100 |                   
  ...llMessage.tsx |   36.36 |      100 |       0 |   36.36 | 17-25             
 ...ponents/shared |   83.71 |    76.08 |   95.45 |   83.71 |                   
  MaxSizedBox.tsx  |   81.62 |    81.81 |   88.88 |   81.62 | ...07-508,613-614 
  ...tonSelect.tsx |   86.45 |    66.66 |     100 |   86.45 | ...52,155-156,230 
  ...eSelector.tsx |     100 |       50 |     100 |     100 | 35-40             
  text-buffer.ts   |   82.38 |    78.02 |   96.66 |   82.38 | ...1797,1824,1874 
  ...er-actions.ts |   86.71 |    67.79 |     100 |   86.71 | ...07-608,809-811 
 src/ui/contexts   |   83.85 |    80.81 |   86.36 |   83.85 |                   
  AppContext.tsx   |      40 |      100 |       0 |      40 | 17-22             
  ...igContext.tsx |   81.81 |       50 |     100 |   81.81 | 15-16             
  ...ssContext.tsx |   84.82 |    84.67 |     100 |   84.82 | ...87-592,688-690 
  ...owContext.tsx |   89.28 |       80 |   66.66 |   89.28 | 34,47-48,60-62    
  ...onContext.tsx |    91.3 |      100 |     100 |    91.3 | 98-99,103-106     
  ...gsContext.tsx |   83.33 |       50 |     100 |   83.33 | 17-18             
  ...ngContext.tsx |   71.42 |       50 |     100 |   71.42 | 17-20             
  ...nsContext.tsx |   86.66 |       50 |     100 |   86.66 | 53-54             
  ...teContext.tsx |      80 |       50 |     100 |      80 | 115-116           
  ...deContext.tsx |   67.39 |    28.57 |      50 |   67.39 | 47-48,52-59,75-80 
 src/ui/editors    |   93.18 |    85.71 |   66.66 |   93.18 |                   
  ...ngsManager.ts |   93.18 |    85.71 |   66.66 |   93.18 | 48,62-63          
 src/ui/hooks      |   79.22 |    80.92 |   83.52 |   79.22 |                   
  ...dProcessor.ts |   78.76 |    80.19 |     100 |   78.76 | ...47-450,461-479 
  ...dProcessor.ts |   96.34 |    76.31 |     100 |   96.34 | ...12-213,218-219 
  ...dProcessor.ts |   81.22 |    71.59 |   71.42 |   81.22 | ...97-401,462-490 
  ...Completion.ts |   92.77 |    89.28 |     100 |   92.77 | ...85-186,219-222 
  ...tIndicator.ts |     100 |      100 |     100 |     100 |                   
  ...ketedPaste.ts |    23.8 |      100 |       0 |    23.8 | 19-37             
  ...ompletion.tsx |    95.3 |       80 |     100 |    95.3 | ...24-225,227-228 
  useCompletion.ts |    92.4 |     87.5 |     100 |    92.4 | 68-69,93-94,98-99 
  ...leMessages.ts |   98.68 |       95 |     100 |   98.68 | 55                
  ...orSettings.ts |     100 |      100 |     100 |     100 |                   
  useFocus.ts      |     100 |      100 |     100 |     100 |                   
  ...olderTrust.ts |     100 |      100 |     100 |     100 |                   
  ...miniStream.ts |   71.24 |    70.34 |      75 |   71.24 | ...3-894,924-1026 
  ...BranchName.ts |   91.66 |    84.61 |     100 |   91.66 | 57-63             
  ...oryManager.ts |   98.41 |    93.33 |     100 |   98.41 | 43                
  ...stListener.ts |   84.37 |    58.33 |     100 |   84.37 | 19,24,41,43-44    
  ...putHistory.ts |    92.5 |    85.71 |     100 |    92.5 | 62-63,71,93-95    
  ...storyStore.ts |     100 |    94.11 |     100 |     100 | 66                
  useKeypress.ts   |     100 |      100 |     100 |     100 |                   
  ...rdProtocol.ts |   36.36 |      100 |       0 |   36.36 | 24-31             
  ...gIndicator.ts |     100 |      100 |     100 |     100 |                   
  useLogger.ts     |      25 |      100 |       0 |      25 | 15-33             
  ...ssageQueue.ts |     100 |      100 |     100 |     100 |                   
  ...raseCycler.ts |    95.6 |       80 |     100 |    95.6 | ...73-174,190-192 
  ...cySettings.ts |   86.48 |    78.94 |     100 |   86.48 | ...08-109,120-131 
  ...Completion.ts |   40.64 |    56.52 |     100 |   40.64 | ...23-224,226-227 
  ...ndFallback.ts |   98.27 |    96.42 |     100 |   98.27 | 69-71             
  ...lScheduler.ts |   78.92 |    94.44 |     100 |   78.92 | ...01-204,291-301 
  ...oryCommand.ts |       0 |        0 |       0 |       0 | 1-7               
  ...ompletion.tsx |    94.2 |      100 |     100 |    94.2 | 78-81             
  ...ngsCommand.ts |   18.75 |      100 |       0 |   18.75 | 10-25             
  ...ellHistory.ts |   91.66 |    79.41 |     100 |   91.66 | ...69,117-118,128 
  ...oryCommand.ts |       0 |        0 |       0 |       0 | 1-76              
  ...Completion.ts |   77.56 |    79.77 |      90 |   77.56 | ...44-452,473-479 
  ...tateAndRef.ts |   13.63 |      100 |       0 |   13.63 | 16-36             
  ...rminalSize.ts |   18.18 |      100 |       0 |   18.18 | 12-32             
  ...emeCommand.ts |    7.89 |      100 |       0 |    7.89 | 24-101            
  useTimer.ts      |   88.09 |    85.71 |     100 |   88.09 | 44-45,51-53       
  ...eMigration.ts |   11.11 |      100 |       0 |   11.11 | 16-70             
  vim.ts           |   83.57 |     79.5 |     100 |   83.57 | ...38,742-750,759 
 src/ui/privacy    |   14.52 |      100 |       0 |   14.52 |                   
  ...acyNotice.tsx |   10.38 |      100 |       0 |   10.38 | 21-117            
  ...acyNotice.tsx |   14.28 |      100 |       0 |   14.28 | 16-59             
  ...acyNotice.tsx |   12.19 |      100 |       0 |   12.19 | 16-62             
  ...acyNotice.tsx |   30.76 |      100 |       0 |   30.76 | 19-36,39-41       
 src/ui/themes     |   98.99 |    67.74 |     100 |   98.99 |                   
  ansi-light.ts    |     100 |      100 |     100 |     100 |                   
  ansi.ts          |     100 |      100 |     100 |     100 |                   
  atom-one-dark.ts |     100 |      100 |     100 |     100 |                   
  ayu-light.ts     |     100 |      100 |     100 |     100 |                   
  ayu.ts           |     100 |      100 |     100 |     100 |                   
  color-utils.ts   |     100 |      100 |     100 |     100 |                   
  default-light.ts |     100 |      100 |     100 |     100 |                   
  default.ts       |     100 |      100 |     100 |     100 |                   
  dracula.ts       |     100 |      100 |     100 |     100 |                   
  github-dark.ts   |     100 |      100 |     100 |     100 |                   
  github-light.ts  |     100 |      100 |     100 |     100 |                   
  googlecode.ts    |     100 |      100 |     100 |     100 |                   
  no-color.ts      |     100 |      100 |     100 |     100 |                   
  ...tic-tokens.ts |     100 |      100 |     100 |     100 |                   
  ...-of-purple.ts |     100 |      100 |     100 |     100 |                   
  theme-manager.ts |    87.5 |    78.33 |     100 |    87.5 | ...86-292,297-298 
  theme.ts         |     100 |    42.55 |     100 |     100 | 255-270           
  xcode.ts         |     100 |      100 |     100 |     100 |                   
 src/ui/utils      |   66.03 |    80.36 |   79.36 |   66.03 |                   
  ...Colorizer.tsx |   79.31 |    81.25 |     100 |   79.31 | ...51-154,190-216 
  ...olePatcher.ts |      78 |    77.77 |     100 |      78 | 58-69             
  ...nRenderer.tsx |   52.85 |    27.27 |     100 |   52.85 | ...26-132,142-144 
  ...wnDisplay.tsx |   85.88 |    87.69 |     100 |   85.88 | ...73-281,314-337 
  ...eRenderer.tsx |   78.09 |    76.19 |     100 |   78.09 | 55-83             
  ...boardUtils.ts |   32.25 |     37.5 |     100 |   32.25 | ...55-114,129-145 
  commandUtils.ts  |   92.79 |    88.37 |     100 |   92.79 | ...12,116,118-119 
  computeStats.ts  |     100 |      100 |     100 |     100 |                   
  displayUtils.ts  |     100 |      100 |     100 |     100 |                   
  formatters.ts    |   90.47 |    95.83 |     100 |   90.47 | 57-60             
  highlight.ts     |     100 |      100 |     100 |     100 |                   
  isNarrowWidth.ts |     100 |      100 |     100 |     100 |                   
  ...olDetector.ts |    7.89 |      100 |       0 |    7.89 | ...11-112,115-116 
  ...nUtilities.ts |   69.84 |    85.71 |     100 |   69.84 | 75-91,100-101     
  ...mConstants.ts |     100 |      100 |     100 |     100 |                   
  terminalSetup.ts |       4 |      100 |       0 |       4 | 40-342            
  textUtils.ts     |   94.11 |    82.35 |     100 |   94.11 | 17-18             
  updateCheck.ts   |     100 |    80.95 |     100 |     100 | 27-39             
 src/utils         |   49.51 |       91 |   89.39 |   49.51 |                   
  checks.ts        |   33.33 |      100 |       0 |   33.33 | 23-28             
  cleanup.ts       |   65.38 |      100 |   66.66 |   65.38 | 28-37             
  deepMerge.ts     |     100 |    89.28 |     100 |     100 | 41-43,49          
  ...ScopeUtils.ts |   79.06 |    66.66 |     100 |   79.06 | 55-64             
  ...arResolver.ts |   96.42 |       96 |     100 |   96.42 | 111-112           
  errors.ts        |   71.42 |       50 |     100 |   71.42 | 11-12             
  events.ts        |     100 |      100 |     100 |     100 |                   
  gitUtils.ts      |   94.66 |    82.35 |     100 |   94.66 | 75-78             
  ...AutoUpdate.ts |    51.2 |       95 |      50 |    51.2 | 84-149            
  ...lationInfo.ts |     100 |      100 |     100 |     100 |                   
  package.ts       |   88.88 |       80 |     100 |   88.88 | 33-34             
  readStdin.ts     |   79.24 |       90 |      80 |   79.24 | 31-38,50-52       
  resolvePath.ts   |   66.66 |       25 |     100 |   66.66 | 12-13,16,18-19    
  sandbox.ts       |       0 |        0 |       0 |       0 | 1-947             
  settingsUtils.ts |   87.33 |    94.73 |   96.87 |   87.33 | ...91-418,462-463 
  spawnWrapper.ts  |     100 |      100 |     100 |     100 |                   
  ...upWarnings.ts |   53.84 |    33.33 |     100 |   53.84 | 17-26,38-39       
  ...entEmitter.ts |     100 |      100 |     100 |     100 |                   
  ...upWarnings.ts |     100 |      100 |     100 |     100 |                   
  version.ts       |     100 |       50 |     100 |     100 | 11                
 ...ed-integration |   25.12 |        0 |       0 |   25.12 |                   
  acp.ts           |    3.29 |        0 |       0 |    3.29 | ...53-289,292-339 
  ...temService.ts |   19.35 |      100 |       0 |   19.35 | 15-19,22-34,37-46 
  schema.ts        |     100 |      100 |     100 |     100 |                   
  ...ntegration.ts |    3.15 |        0 |       0 |    3.15 | ...18-865,880-930 
-------------------|---------|----------|---------|---------|-------------------
Core Package - Full Text Report
-------------------|---------|----------|---------|---------|-------------------
File               | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s 
-------------------|---------|----------|---------|---------|-------------------
All files          |   78.89 |    84.56 |   78.21 |   78.89 |                   
 src               |     100 |      100 |     100 |     100 |                   
  index.ts         |     100 |      100 |     100 |     100 |                   
 src/__mocks__/fs  |     100 |      100 |     100 |     100 |                   
  promises.ts      |     100 |      100 |     100 |     100 |                   
 src/code_assist   |   78.26 |     83.6 |   80.85 |   78.26 |                   
  codeAssist.ts    |    17.5 |      100 |       0 |    17.5 | 16-38,41-54       
  converter.ts     |    96.2 |    95.55 |     100 |    96.2 | 179-183,197       
  ...al-storage.ts |     100 |    74.07 |     100 |     100 | 39-41,72-75       
  oauth2.ts        |   80.94 |    82.43 |   91.66 |   80.94 | ...87-488,511-512 
  server.ts        |   54.21 |    73.33 |   57.14 |   54.21 | ...30-233,252-253 
  setup.ts         |   86.66 |    78.94 |     100 |   86.66 | ...,92-94,118-124 
  types.ts         |     100 |      100 |     100 |     100 |                   
 src/config        |    79.6 |    91.04 |   53.84 |    79.6 |                   
  config.ts        |   78.69 |    90.43 |   48.38 |   78.69 | ...59-860,924-925 
  models.ts        |     100 |      100 |     100 |     100 |                   
  storage.ts       |   84.44 |    94.73 |      75 |   84.44 | ...14-115,118-119 
 src/core          |    80.3 |    84.93 |   75.35 |    80.3 |                   
  client.ts        |   80.26 |    84.66 |   71.42 |   80.26 | ...97-805,872-880 
  ...tGenerator.ts |    92.7 |    79.16 |     100 |    92.7 | 83-84,135,155-158 
  ...lScheduler.ts |   83.23 |       83 |   88.46 |   83.23 | ...1023,1103-1107 
  geminiChat.ts    |    80.3 |    86.61 |   64.51 |    80.3 | ...66-867,870-871 
  geminiRequest.ts |     100 |      100 |     100 |     100 |                   
  logger.ts        |   82.94 |    81.81 |     100 |   82.94 | ...44-348,388-399 
  ...tGenerator.ts |   13.51 |      100 |      10 |   13.51 | ...87-188,191-194 
  ...olExecutor.ts |     100 |      100 |      50 |     100 |                   
  prompts.ts       |   97.77 |      100 |   66.66 |   97.77 | 347-404           
  subagent.ts      |   86.81 |    81.31 |     100 |   86.81 | ...54-662,705-706 
  tokenLimits.ts   |   14.28 |      100 |       0 |   14.28 | 15-32             
  turn.ts          |   87.86 |    88.46 |     100 |   87.86 | ...20,333-334,381 
 src/fallback      |     100 |      100 |     100 |     100 |                   
  handler.ts       |     100 |      100 |     100 |     100 |                   
  types.ts         |     100 |      100 |     100 |     100 |                   
 src/generated     |     100 |      100 |     100 |     100 |                   
  git-commit.ts    |     100 |      100 |     100 |     100 |                   
 src/ide           |   73.97 |    81.81 |      76 |   73.97 |                   
  constants.ts     |     100 |      100 |     100 |     100 |                   
  detect-ide.ts    |   96.96 |    97.29 |     100 |   96.96 | 63-65             
  ide-client.ts    |   58.55 |    71.69 |      60 |   58.55 | ...16-524,549-557 
  ide-installer.ts |   91.45 |       84 |     100 |   91.45 | ...95,130-134,147 
  ideContext.ts    |    83.8 |      100 |     100 |    83.8 | 75-91             
  process-utils.ts |   87.09 |    74.19 |     100 |   87.09 | ...25,156,166-167 
 src/mcp           |   76.59 |    69.13 |   80.48 |   76.59 |                   
  ...h-provider.ts |   86.36 |      100 |   33.33 |   86.36 | ...85,89,93,97-98 
  ...h-provider.ts |   74.52 |    52.17 |     100 |   74.52 | ...03-807,814-816 
  ...en-storage.ts |   97.39 |      100 |      90 |   97.39 | 67-69             
  oauth-utils.ts   |   67.78 |    78.57 |    90.9 |   67.78 | ...61-282,307-330 
 .../token-storage |   88.18 |    86.46 |      95 |   88.18 |                   
  ...en-storage.ts |     100 |      100 |     100 |     100 |                   
  ...en-storage.ts |   82.75 |    82.35 |   92.85 |   82.75 | ...62-172,180-181 
  ...en-storage.ts |     100 |      100 |     100 |     100 |                   
  ...en-storage.ts |   85.71 |    81.81 |      90 |   85.71 | ...25-227,249-250 
  types.ts         |     100 |      100 |     100 |     100 |                   
 src/mocks         |     100 |      100 |     100 |     100 |                   
  msw.ts           |     100 |      100 |     100 |     100 |                   
 src/prompts       |   26.41 |      100 |      25 |   26.41 |                   
  mcp-prompts.ts   |   18.18 |      100 |       0 |   18.18 | 11-19             
  ...t-registry.ts |   28.57 |      100 |   28.57 |   28.57 | ...42,48-55,68-73 
 src/services      |   90.03 |    83.67 |   91.66 |   90.03 |                   
  ...ingService.ts |   85.27 |    57.14 |     100 |   85.27 | ...13-415,442-444 
  ...eryService.ts |   93.33 |    88.46 |   85.71 |   93.33 | 32,40,85,110-111  
  ...temService.ts |     100 |      100 |     100 |     100 |                   
  gitService.ts    |      70 |    93.33 |      60 |      70 | ...15-125,128-132 
  ...ionService.ts |   97.53 |     92.1 |     100 |   97.53 | ...44-345,351-352 
  ...ionService.ts |   92.21 |    88.15 |     100 |   92.21 | ...68-369,446-460 
 src/telemetry     |   67.28 |    82.48 |   69.14 |   67.28 |                   
  constants.ts     |     100 |      100 |     100 |     100 |                   
  ...-exporters.ts |   26.47 |      100 |       0 |   26.47 | ...84,87-88,91-92 
  index.ts         |     100 |      100 |     100 |     100 |                   
  ...t.circular.ts |       0 |        0 |       0 |       0 | 1-63              
  ...t.circular.ts |       0 |        0 |       0 |       0 | 1-128             
  loggers.ts       |   63.15 |    72.72 |   59.09 |   63.15 | ...98-617,620-639 
  metrics.ts       |   63.72 |     93.1 |      50 |   63.72 | ...59-261,267-269 
  sdk.ts           |   78.37 |    44.44 |     100 |   78.37 | ...83,188-189,191 
  ...etry-utils.ts |     100 |      100 |     100 |     100 |                   
  ...l-decision.ts |     100 |      100 |     100 |     100 |                   
  types.ts         |   80.61 |     87.5 |   82.85 |   80.61 | ...06-417,420-431 
  uiTelemetry.ts   |    99.3 |    95.83 |     100 |    99.3 | 130               
 ...learcut-logger |   74.01 |    78.66 |    64.7 |   74.01 |                   
  ...cut-logger.ts |   70.98 |    78.37 |    64.7 |   70.98 | ...48-949,952-955 
  ...tadata-key.ts |     100 |      100 |     100 |     100 |                   
 src/test-utils    |   76.01 |     92.3 |   66.66 |   76.01 |                   
  config.ts        |     100 |      100 |     100 |     100 |                   
  index.ts         |     100 |      100 |     100 |     100 |                   
  mock-tool.ts     |   18.64 |        0 |       0 |   18.64 | ...8,61-62,72-115 
  ...aceContext.ts |     100 |      100 |     100 |     100 |                   
  tools.ts         |   95.86 |    95.23 |      80 |   95.86 | 60-61,129,133-134 
 src/tools         |   74.22 |    81.91 |   81.92 |   74.22 |                   
  diffOptions.ts   |     100 |      100 |     100 |     100 |                   
  edit.ts          |   83.02 |    88.37 |   85.71 |   83.02 | ...46-447,537-577 
  glob.ts          |    91.6 |    82.69 |    87.5 |    91.6 | ...26-227,318-319 
  grep.ts          |   59.25 |    82.14 |      80 |   59.25 | ...04-608,618-619 
  ls.ts            |   97.37 |    91.66 |     100 |   97.37 | 140-145           
  ...nt-manager.ts |   79.04 |    66.66 |      80 |   79.04 | ...31-138,146-147 
  mcp-client.ts    |    31.6 |    78.02 |      50 |    31.6 | ...1367,1371-1374 
  mcp-tool.ts      |    94.9 |    93.05 |   94.11 |    94.9 | 200-210,272-273   
  memoryTool.ts    |   83.15 |    82.97 |   88.88 |   83.15 | ...31-246,357-375 
  ...iable-tool.ts |     100 |    84.61 |     100 |     100 | 99,106            
  read-file.ts     |    98.7 |    97.14 |    87.5 |    98.7 | 63-64             
  ...many-files.ts |   79.91 |    78.37 |   85.71 |   79.91 | ...83-484,491-492 
  ripGrep.ts       |   89.94 |    86.73 |   93.33 |   89.94 | ...52-453,474-475 
  shell.ts         |   85.54 |    77.33 |    90.9 |   85.54 | ...94-395,406-407 
  smart-edit.ts    |   78.93 |    75.24 |   85.71 |   78.93 | ...88-790,802-845 
  tool-error.ts    |     100 |      100 |     100 |     100 |                   
  tool-registry.ts |    72.8 |    66.07 |   77.77 |    72.8 | ...08-410,437-438 
  tools.ts         |   86.34 |    89.13 |      75 |   86.34 | ...68-369,385-391 
  web-fetch.ts     |    60.3 |    54.05 |    90.9 |    60.3 | ...45-346,353-354 
  web-search.ts    |     100 |     93.1 |     100 |     100 | 106-107           
  write-file.ts    |   82.08 |    79.36 |      75 |   82.08 | ...49-452,464-500 
 src/utils         |    86.2 |    88.12 |   88.61 |    86.2 |                   
  LruCache.ts      |   70.96 |     62.5 |     100 |   70.96 | 20-22,28,30-34    
  bfsFileSearch.ts |   89.02 |    90.47 |     100 |   89.02 | 86-94             
  browser.ts       |    7.69 |      100 |       0 |    7.69 | 17-56             
  editCorrector.ts |    77.3 |    61.11 |   91.66 |    77.3 | ...66-678,712,726 
  editor.ts        |   97.63 |    94.54 |     100 |   97.63 | 154,224,227-228   
  ...entContext.ts |     100 |      100 |     100 |     100 |                   
  errorParsing.ts  |     100 |     92.3 |     100 |     100 | 76,80,86          
  ...rReporting.ts |   83.72 |    84.61 |     100 |   83.72 | 82-86,107-115     
  errors.ts        |    55.4 |    71.42 |      50 |    55.4 | ...0,76-92,96-102 
  fetch.ts         |   34.04 |      100 |       0 |   34.04 | 22-27,31-57       
  fileUtils.ts     |   95.32 |    90.37 |     100 |   95.32 | ...36-240,450-456 
  formatters.ts    |   54.54 |       50 |     100 |   54.54 | 12-16             
  ...eUtilities.ts |    95.4 |    94.87 |     100 |    95.4 | 16-17,45-46       
  ...rStructure.ts |   95.96 |    94.93 |     100 |   95.96 | ...14-117,345-347 
  getPty.ts        |    12.5 |      100 |       0 |    12.5 | 21-34             
  ...noreParser.ts |    93.6 |    88.23 |     100 |    93.6 | ...75-176,180-181 
  gitUtils.ts      |   51.21 |     90.9 |      50 |   51.21 | 40-41,50-73       
  ide-trust.ts     |      60 |      100 |       0 |      60 | 14-15             
  ...rePatterns.ts |     100 |      100 |     100 |     100 |                   
  ...ionManager.ts |     100 |       90 |     100 |     100 | 23                
  ...-detection.ts |     100 |      100 |     100 |     100 |                   
  ...edit-fixer.ts |   33.87 |      100 |       0 |   33.87 | 100-141,144-145   
  ...yDiscovery.ts |   85.75 |    75.43 |   77.77 |   85.75 | ...87-388,391-392 
  ...tProcessor.ts |   91.51 |    88.46 |   84.61 |   91.51 | ...02-308,385-386 
  ...Inspectors.ts |     100 |      100 |     100 |     100 |                   
  ...kerChecker.ts |   83.33 |    83.33 |     100 |   83.33 | 64-65,75-80,88-94 
  partUtils.ts     |     100 |      100 |     100 |     100 |                   
  pathReader.ts    |     100 |      100 |     100 |     100 |                   
  paths.ts         |   86.13 |    87.87 |     100 |   86.13 | ...,89-90,101-102 
  ...rDetection.ts |    64.4 |    76.19 |     100 |    64.4 | ...4,88-89,99-100 
  retry.ts         |   62.55 |    73.21 |     100 |   62.55 | ...58-278,323-338 
  ...nStringify.ts |     100 |      100 |     100 |     100 |                   
  ...aValidator.ts |    82.6 |       50 |     100 |    82.6 | 27-28,30-31       
  ...r-launcher.ts |   76.52 |     87.5 |   66.66 |   76.52 | ...33,135,153-191 
  session.ts       |     100 |      100 |     100 |     100 |                   
  shell-utils.ts   |   96.08 |    93.96 |     100 |   96.08 | ...94-195,242-244 
  summarizer.ts    |     100 |    88.88 |     100 |     100 | 91                
  ...emEncoding.ts |      98 |    94.11 |     100 |      98 | 106-107           
  testUtils.ts     |   84.44 |    72.72 |   83.33 |   84.44 | 27-28,34-35,70-72 
  textUtils.ts     |    12.5 |      100 |       0 |    12.5 | 15-34             
  tool-utils.ts    |   91.48 |    89.47 |     100 |   91.48 | 53-54,57-58       
  ...untManager.ts |   97.14 |    94.59 |     100 |   97.14 | 36-38             
  ...aceContext.ts |   96.82 |    95.12 |    92.3 |   96.82 | 94-95,109-110     
 ...ils/filesearch |   96.17 |     91.4 |     100 |   96.17 |                   
  crawlCache.ts    |     100 |      100 |     100 |     100 |                   
  crawler.ts       |   96.22 |     92.3 |     100 |   96.22 | 66-67             
  fileSearch.ts    |   93.22 |    87.14 |     100 |   93.22 | ...28-229,231-232 
  ignore.ts        |     100 |      100 |     100 |     100 |                   
  result-cache.ts  |     100 |     92.3 |     100 |     100 | 46                
-------------------|---------|----------|---------|---------|-------------------

For detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run.

Copy link
Collaborator

@anj-s anj-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us wait for Matt to take a look as well.

@mattKorwel
Copy link
Collaborator

Sorry for some of the basic questions but:

  1. is 'poetry' a thing written by us or by swe bench?
  2. where does the docker container come from, where is its source?
  3. when it says we spin up 10 concurrent agents, is that spinning up 10 docker containers or 10 instances of something in cloud run or else where?
  4. where do we see the logs for these runs
  5. regarding the continuous scorer: Outside of this workflow, we run a continuous scorer polling GCS for these diffs where is the code for this. Can the scorer just run in this action to score it right here?
  6. how do we see the the results of the evals?

@mattKorwel mattKorwel self-requested a review September 9, 2025 02:51
Copy link
Collaborator

@mattKorwel mattKorwel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below, I think we need to make this clearer in how its used and what it does and where the code comes from and how we see the results.

@agmsb
Copy link
Contributor Author

agmsb commented Sep 9, 2025

Sorry for some of the basic questions but:

  1. is 'poetry' a thing written by us or by swe bench?
  2. where does the docker container come from, where is its source?
  3. when it says we spin up 10 concurrent agents, is that spinning up 10 docker containers or 10 instances of something in cloud run or else where?
  4. where do we see the logs for these runs
  5. regarding the continuous scorer: Outside of this workflow, we run a continuous scorer polling GCS for these diffs where is the code for this. Can the scorer just run in this action to score it right here?
  6. how do we see the the results of the evals?

Not a problem addressing them one by one below.

is 'poetry' a thing written by us or by swe bench?

The eval harness is written in Python and uses Poetry for dependency/package management.

when it says we spin up 10 concurrent agents, is that spinning up 10 docker containers or 10 instances of something in cloud run or else where?

The harness will spawn 10 concurrent async.io Tasks, but the harness itself will still run in a single container. Each Task will run Gemini CLI with npx, setting up the working directory and passing in the prompt. This will all run on the GitHub Action runner.

regarding the continuous scorer: Outside of this workflow, we run a continuous scorer polling GCS for these diffs where is the code for this. Can the scorer just run in this action to score it right here?

Yes, I think we'll want to do follow up with this quickly after this PR.

where do we see the logs for these runs
how do we see the the results of the evals?

Currently this action in this PR writes out the resolve/unresolved count to stdout. For run-level summaries in a follow-up PR we will log them to stdout so it can be viewable in the GitHub Action UI. For per-eval instance logs and agent trajectories, in a follow-up PR we will upload these to Google Cloud Storage and write out to stdout a link to the GCS bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/models Support for Gemini models, multi-modality, local execution, and performance tuning. area/quality Focuses on testing, reliability, performance, and overall product quality. maintainer For Roadmap Items
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants