Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to interpret results for Auto Code Rover SWE-bench? #47

Open
ramsey-coding opened this issue May 13, 2024 · 2 comments
Open

how to interpret results for Auto Code Rover SWE-bench? #47

ramsey-coding opened this issue May 13, 2024 · 2 comments

Comments

@ramsey-coding
Copy link

I am trying to understand results for Auto Code Rover and SWE-Agent.

Can you please let me know the format of the SWE-Agent test results in:
https://github.com/nus-apr/auto-code-rover/tree/main/results/swe-agent-results

What are all these cost_2_1, cost_2_2, and cost_2_3?

How can I to understand the results in this directory?

Also for Auto Code Reover, I see acr-run-1, acr-run-2, acr-run-3. Which one should I take? Which result are you reporting in the paper?

@ramsey-coding
Copy link
Author

what's the difference between the following fields?

        "generated": 249,
        "with_logs": 249,
        "applied": 245,
        "resolved": 48

@zhiyufan
Copy link
Collaborator

cost_X_Y: X is the budget cost of running swe-agent in our experiment, and Y is the trail of repetition.
In this case, we used a budget of 2 USD, and repeated the experiment 3 times.
Inside the cost_X_Y directory, *.traj files are the conversation log files for each task instance in swe-bench.
all_pred.jsonl includes all the generated patches.

For AutoCodeRover acr-run-1, acr-run-2, and acr-run-3 results align with Table-3, In our environment, the ACR column.

  • generated: there is an agent-generated patch for this issue
  • with_logs: a log file is produced when executing the passing/failing test-cases of this issue
  • applied: the patch can be applied successfully to the original program.
  • resolved: the patch made the passing/failing test-cases of this issue pass

The details on how the stats are generated can be found here: https://github.com/yuntongzhang/SWE-bench/blob/main/metrics/report.py#L264C5-L264C21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants