SciCode

This repo contains the evaluation code for the paper "SciCode: A Research Coding Benchmark Curated by Scientists"

Introduction

SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.

Dataset Creation

SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. Scicode mainly focuses on 1. Numerical methods 2.Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.

🏆 Leaderboard

Model	Subproblem	Main Problem
Claude3.5-Sonnet	26	4.6
GPT-4o	25	1.5
GPT-4-Turbo	22.9	1.5
Gemini 1.5 Pro	21.9	1.5
Claude3-Opus	21.5	1.5
Deepseek-Coder-v2	21.2	3.1
Claude3-Sonnet	17	1.5
Qwen2-72B-Instruct	17	1.5
Mixtral-8x22B-Instruct	16.3	0
Llama-3-70B-Chat	14.6	0

Instructions to evaluate a new model

Clone this repository git clone [email protected]:scicode-bench/SciCode.git
Install the scicode package with pip install -e .
Download the numeric test results and save them as ./eval/data/test_data.h5
Run eval/scripts/gencode_json.py to generate new model outputs (see the eval/scripts readme) for more information
Run eval/scripts/test_generated_code.py to evaluate the unittests

Contact

Minyang Tian: [email protected]
Eliu Huerta: [email protected]
Hao Peng: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
eval		eval
src/scicode		src/scicode
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
meta_learner.py		meta_learner.py
pyproject.toml		pyproject.toml
run.sh		run.sh
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciCode

Introduction

Dataset Creation

🏆 Leaderboard

Instructions to evaluate a new model

Contact

About

Releases

Packages

Languages

License

XuGW-Kevin/SciCode

Folders and files

Latest commit

History

Repository files navigation

SciCode

Introduction

Dataset Creation

🏆 Leaderboard

Instructions to evaluate a new model

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages