Official Code for FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems
Source code repositories consist of large codebases, often containing error-prone programs. The increasing complexity of software has led to a drastic rise in time and costs for identifying and fixing these defects. Various methods exist to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible solutions for a particular bug, there are not many tools and datasets available to evaluate generated code effectively. In this work, we introduce FixEval, a benchmark comprising buggy code submissions to competitive programming problems and their respective fixes. We introduce a richtest suite to evaluate and assess the correctness of model-generated program fixes. We consider two Transformer language models pretrained on programming languages as our baselines, and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately, while execution-based methods evaluate programs through all cases and scenarios specifically designed for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation.
- Table of Contents
- Folder Structure
- Dataset
- Installation
- Pre-processing
- Download Preprocessed Data
- Training and Evaluation
- Benchmarks
- License
- Citation
├── codet5
│ ├── run.sh
│ ├── configs.py
│ ├── models.py
│ ├── run_gen.py
│ └── ...
│
├── plbart
│ ├── run.sh
│ ├── configs.py
│ ├── models.py
│ ├── run_gen.py
│ └── ...
│
├── data
│ ├── java
│ │ ├──jsons
│ │ ├──processed
│ ├── python
│ │ ├──jsons
│ │ ├──processed
│ ├── atcoder_test_cases
│ └── processed.json
│
├── third_party
│ ├── apex
│ ├── fairseq
│ ├── tree-sitter-cpp
│ ├── tree-sitter-java
│ └── tree-sitter-python
│
├── evaluation
│ ├── CodeBLEU
│ ├── codegen
│ ├── bleu.py
│ ├── compile.py
│ ├── compute_ca.py
│ ├── evaluator.py
│ ├── execution_evaluation_TC_arc_MP.py
│ └── ...
│
└── src
├── 01_preprocessing.ipynb
├── make_submission_list_json.py
├── process_json.py
├── deduplication.py
├── generate_eval_files.py
├── merge.py
├── split.py
└── ...
All data for reproducing the results is available here:
https://drive.google.com/drive/folders/1dzuHuouuWzlFCy1CMj9DYG9JGraEay27?usp=sharing
Run the following commands in the root folder.
Run this command to download the whole CodeNet dataset (around 8GB zip file) in the root directory and decompress it.
wget https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet.tar.gz
tar -xf Project_CodeNet.tar.gz
Run this command to download the CodeNet metadata (281Mb zip file) in the root directory and decompress it
wget https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet_metadata.tar.gz
tar -xf Project_CodeNet_metadata.tar.gz
Make the data folder to store the test cases along with the Java and Python data files.
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1AInTHzaZqym7WsT1B7yc8nZy7dA3ovPf' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1AInTHzaZqym7WsT1B7yc8nZy7dA3ovPf" -O atcoder_test_cases.zip && rm -rf /tmp/cookies.txt
unzip atcoder_test_cases.zip
cd ../
The preferred installation method is to run this command (You may need to change the bash file to update the environment names, etc.):
bash install_env.sh
Another method is to run the following (You may need to manually add some libraries):
conda env create -n python -f src/environment.yml
conda activate python36
All the commands below assume that you installed everything in this environment correctly and activated the environment.
src/make_submission_list_json.py
parses problem submission information, problem list csv, and the actual submission files folder to create an initial json, processed.json
, which uses the following format:
processed
is a dictionary containing a list of user_id's with information about each user in processed.keys()
.
processed['user_id']
is a list containing a list of problem_id's solved by that user.
processed['user_id']['problem_id']
contains list of tuples. Each tuple consists of information about a submission (submission_id,date,language,original_language,filename_ext,status)
To create this, use the followint script (You may need to change the path information):
cd src
python make_submission_list_json.py
cd ../
If there is any file missing Like "my_languages.so" Please check the folder and if it's not there please create an issue. I will make that available ASAP.
https://drive.google.com/drive/folders/1dzuHuouuWzlFCy1CMj9DYG9JGraEay27?usp=sharing
We use the processed.json
file to create the training data chunk by chunk (10k per file) and store them in the data folder for individual programming languages. The following code preprocesses and stores both Java and Python data into the json format in folders stored at data/{language}/jsons/
.
cd src
python process_json.py
cd ../
Or, you can also download the processed.json
file, which is the root file for all data generation and processing:
cd data/
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1gxZYObARqJytI9gf6gEX-CZhCpc4JPE6' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1gxZYObARqJytI9gf6gEX-CZhCpc4JPE6" -O processed.zip && rm -rf /tmp/cookies.txt
unzip processed.zip
cd ../
split.py
merges all the json chunks, deduplicates using jaccard similarity function, and splits the data into the train-valid-test (80-10-10) ratio. This is done on the problem level so that no datapoints for a single problem exist in multiple splits, like train and test. During the split, we also mantain the condition that for all the datapoints in the valid and test sets- we have the test cases available so that execution-based evaluation can be done on both the valid and test set data.
cd src
python split.py
python split.py --lang py --src_file ../data/Python/jsons/ --src_dir ../data/Python/processed/ --out_dir ../data/Python/processed/
cd ../
Run the following commands if you want to download the processed data and train:
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1vsuUrJ2j86EYGb2WWQatqsqJ-V8Sl6en' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1vsuUrJ2j86EYGb2WWQatqsqJ-V8Sl6en" -O java.zip && rm -rf /tmp/cookies.txt
unzip java.zip
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1rjjYW8SB8f5Hr34ig84OKpNYOzdt03Ar' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1rjjYW8SB8f5Hr34ig84OKpNYOzdt03Ar" -O python.zip && rm -rf /tmp/cookies.txt
unzip python.zip
After successful completion, we derive 4 datasets from this part:
- java buggy code to java fixed code (
data/java/processed/
) - java buggy code with verdict information to java fixed code (
data/java/processed_with_verdict/
) - python buggy code to python fixed code (
data/python/processed/
) - python buggy code with verdict information to python fixed code in (
data/python/processed_with_verdict/
)
Each of these 4 directories contains:
{train, test, valid}.jsonl
files containing all the information for the datapoints. This also allows us to always revert back to the original dataset{train, test, valid}.{language-language}.id
files, where language is in the set [java, python]- 6 raw test files for training.
{src, tgt}_{train, test, valid}.{language-language}.language
To use our open sourced pretrained models and data files, download plbart.zip or codeT5.zip from the link below and verify the results using the same procedure.
https://drive.google.com/drive/folders/1dzuHuouuWzlFCy1CMj9DYG9JGraEay27?usp=sharing
And then go to that specific folder and run the run.sh command. More instructions later n this page.
cd plbart/
./run.sh
To run the codet5 model, go to the codet5
folder and use the run.sh
script file. This will also evalualte the model on match-based metrics (BLEU, CodeBleu, Syntax Match, Dataflow Match, etc.).
Some changes are required to execute the run.sh
script:
- Change the source and target languages on lines 14-15 to one of these ['java', 'python']
- Change
path_2_data
at the end of line 22 to the folder name with the processed or processed_with_verdict data - Change line 27 to make the Model and Cached data save directory consistent with the data as well. For example, append "_with_verdict" if the associated data path contains "_with_verdict" as well.
To simply run the evaluation, comment out the train function in the bottom of therun.sh
file
Each run.sh
file has a similar structure:
./run.sh GPU_ID SRC_LANGUAGE TARGET_LANGUAGE DATA_SOURCE WITH_VERDICT
GPU_ID
is how many GPUs you want to use. For single GPU, input "0".
SRC_LANGUAGE
, TARGET_LANGUAGE
are both the same for a single run. They can be either "java" or "python".
DATA_SOURCE
is the location of the preprocessed data. For example, "codenet
" if the stored preprocessed data folder is named "codenet".
WITH_VERDICT
can be either "true" or "false" depending on if you want to use the verdict information in the input or not.
cd codet5/
nohup ./run.sh 0 java java codenet false #TODO Briefly explain one or all of these examples i.e.:
nohup ./run.sh 0 java java codenet true #Executes the Java dataset with one GPU and verdict information
nohup ./run.sh 0 python python codenet false #Executes the Python dataset with one GPU and without verdict information
nohup ./run.sh 0 python python codenet true
Similarly, for training and evaluating the plbart model, navigate to the root directory and use the following:
cd plbart/
nohup ./run.sh 0 java java codenet false
nohup ./run.sh 0 java java codenet true
nohup ./run.sh 0 python python codenet false
nohup ./run.sh 0 python python codenet true
The run.sh
script for each of the models contains 3 function:
train
-> Trains that specific model and saves the checkpoints and logs all the necessary matrices.evaluate
-> Loads a pretrained model (usually the checkpoint-best-ppl) model and evaluates all metrics except the execution-based evaluation with pass@k accuracy.generate
-> Loads a pretrained model (usually the checkpoint-best-ppl) and generates a json file with the predictions from the loaded model.
This part is not included in the usual evaluation because changes are required based on your system to run this efficiently.
First, run the below commands. These commands will create 4 additional splits in the 4 core data folders (data/language/{processed, processed_with_verdict}
) named eval
which are similar to train
, valid
, and test
but smaller.
The main difference between eval and test set is that {train, test, valid}
are created using our split method and all the datapoints are split between these.
But here we create an eval split which is sampled from the test datapoints using generate_eval_files.py
, keeping the true data distribution similar to the test file but on a smaller scale (500 in our case) to keep the runtime and computational complexity in check as we need to generate multiple submissions and run each of them with many test cases to calculate our pass@k accuracy.
cd src/
python generate_eval_files.py
python generate_eval_files.py --with_verdict True
python generate_eval_files.py --lang python
python generate_eval_files.py --with_verdict True --lang python
cd ../
Go to the specific model folder and execute the run.sh
command with only the generate
function uncommented and save_dir
, path_2_data
, and languages
set to the correct versions. For example:
cd plbart/
./run.sh
To use our open sourced pretrained models, download plbart.zip or codeT5.zip from the link below and verify the results using the same procedure.
https://drive.google.com/drive/folders/1dzuHuouuWzlFCy1CMj9DYG9JGraEay27?usp=sharing
Pre-preprocess the generated files that contains all tokenized and detokenized source, target, and predictions
First, we need to create a self-contained json with all of the necessary versions to detokenize the code and execute. We split this portion explicitly because it is not possible to run the code and install all the libraries required to tokenize the Java and Python programs using the ARC (Advanced Research Computing) supercomputer at Virginia Tech. Thus, we do it elsewhere and create the resulting json file which can be used to generate results.
cd src/
python merge.py --references data/java/processed/generation.json --language java
python merge.py --references data/java/processed_with_verdict/generation.json --language java
python merge.py --references data/python/processed/generation.json --language python
python merge.py --references data/python/processed_with_verdict/generation.json --language python
cd ../
These will create 4 json files. You may need to change the output file names for your own clarification.
First, we expect the test cases folder and the "problem_list.csv
" file to be in the root directory. So let's copy those:
cp -r data/atcoder_test_cases atcoder_test_cases
cp Project_CodeNet/metadata/problem_list.csv problem_list.csv
Now, let's run the execute
and evaluate
methods:
python evaluation/execution_evaluation_TC_arc_MP.py --references test_python2python_with_verdict_output.jsonl --language python --test_cases atcoder_test_cases --problem_list problem_list.csv
To run on ARC, we provide a file for using in slurm clusters where you might need to change your credentials.
sbatch batch_run.sh
The previous commands will create a json file which contains all the fields necessary for visualizing and getting pass@k accuracy.
We can use results.py
to generate the results.
We can also use the previous json in the src/01_preprocessing.ipynb
notebook for visualizing.
We evaluate the models' performances on the test set in terms of Compilation Accuracy (CA), BLEU, Syntax Match (SM), Dataflow Match (DM), CodeBLEU (CB), and Exact Match (EM). We report the model performances below.
Method | Language | Verdict | BLEU | EM | SM | DM | CB | CA |
---|---|---|---|---|---|---|---|---|
Naive Copy | Java | No | 80.28 | 0.03 | 84.22 | 53.64 | 75.43 | 89.93 |
Python | No | 68.55 | 0.73 | 70.12 | 60.51 | 68.47 | 96.56 | |
PLBART | Java | No | 58.49 | 0.45 | 66.92 | 43.08 | 57.23 | 31.36 |
Yes | 59.84 | 1.46 | 68.01 | 44.99 | 58.62 | 33.04 | ||
Python | No | 61.89 | 2.32 | 64.32 | 48.81 | 61.13 | 91.16 | |
Yes | 62.25 | 2.46 | 63.31 | 49.73 | 62.21 | 92.21 | ||
CodeT5 | Java | No | 62.31 | 2.96 | 74.01 | 52.30 | 63.37 | 63.03 |
Yes | 62.54 | 2.45 | 73.93 | 53.29 | 63.71 | 64.23 | ||
Python | No | 64.92 | 2.74 | 68.79 | 56.21 | 63.53 | 92.80 | |
Yes | 64.67 | 2.97 | 68.45 | 56.04 | 63.28 | 92.70 |
We also evaluate our model using pass@k and test case average. Here are the benckmark results:
Language | Verdict | pass@k | TCA@k | ||||||
---|---|---|---|---|---|---|---|---|---|
k = 1 | k = 3 | k = 5 | k = 10 | k = 1 | k = 3 | k = 5 | k = 10 | ||
Java | No | 8.65 | 15.62 | 19.63 | 24.44 | 41.00 | 34.00 | 32.70 | 29.60 |
Yes | 10.94 | 18.77 | 22.66 | 27.96 | 44.99 | 38.80 | 35.87 | 32.90 | |
Python | No | 6.86 | 13.07 | 16.27 | 20.51 | 50.20 | 41.20 | 38.50 | 35.20 |
Yes | 7.32 | 13.94 | 17.47 | 22.63 | 58.75 | 41.16 | 38.37 | 34.88 |
MIT License
Copyright (c) 2022 Md. Mahim Anjum Haque
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
@article{haque2022fixeval,
title={FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems},
author={Haque, Md Mahim Anjum and Ahmad, Wasi Uddin and Lourentzou, Ismini and Brown, Chris},
journal={arXiv preprint arXiv:2206.07796},
year={2022}
}