Skip to content

We introduce FixEval , a dataset for competitive programming bug fixing along with a comprehensive test suite and show the necessity of execution based evaluation compared to suboptimal match based evaluation metrics like BLEU, CodeBLEU, Syntax Match, Exact Match etc.


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation


Source code repositories consist of large codebases, often containing error-prone programs. The increasing complexity of software has led to a drastic rise in time and costs for identifying and fixing these defects. Various methods exist to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible solutions for a particular bug, there are not many tools and datasets available to evaluate generated code effectively. In this work, we introduce FixEval, a benchmark comprising buggy code submissions to competitive programming problems and their respective fixes. We introduce a richtest suite to evaluate and assess the correctness of model-generated program fixes. We consider two Transformer language models pretrained on programming languages as our baselines, and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately, while execution-based methods evaluate programs through all cases and scenarios specifically designed for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation.

Table of Contents

Folder Structure

├── codet5
│   ├── 
│   ├──
│   ├──
│   ├──
│   └── ...
├── plbart
│   ├── 
│   ├──
│   ├──
│   ├──
│   └── ...
├── data
│   ├── java
│   │    ├──jsons
│   │    ├──processed
│   ├── python
│   │    ├──jsons
│   │    ├──processed
│   ├── atcoder_test_cases
│   └── processed.json
├── third_party
│   ├── apex
│   ├── fairseq
│   ├── tree-sitter-cpp
│   ├── tree-sitter-java
│   └── tree-sitter-python
├── evaluation
│   ├── CodeBLEU 
│   ├── codegen 
│   ├──
│   ├──
│   ├──
│   ├──
│   ├──
│   └── ...
└── src
    ├── 01_preprocessing.ipynb
    └── ...


All data for reproducing the results is available here:

Run the following commands in the root folder.

Download Project CodeNet Dataset (Skip this if you want to run from our preprocessed files)

Run this command to download the whole CodeNet dataset (around 8GB zip file) in the root directory and decompress it.

tar -xf Project_CodeNet.tar.gz

Download CodeNet Metadata

Run this command to download the CodeNet metadata (281Mb zip file) in the root directory and decompress it

tar -xf Project_CodeNet_metadata.tar.gz

Download Test Cases

Make the data folder to store the test cases along with the Java and Python data files.

wget --load-cookies /tmp/cookies.txt "$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate '' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1AInTHzaZqym7WsT1B7yc8nZy7dA3ovPf" -O && rm -rf /tmp/cookies.txt
cd ../


The preferred installation method is to run this command (You may need to change the bash file to update the environment names, etc.):


Another method is to run the following (You may need to manually add some libraries):

conda env create -n python -f src/environment.yml
conda activate python36

All the commands below assume that you installed everything in this environment correctly and activated the environment.

Pre-processing (Skip this if you want to run from our preprocessed files)

src/ parses problem submission information, problem list csv, and the actual submission files folder to create an initial json, processed.json, which uses the following format:

processed is a dictionary containing a list of user_id's with information about each user in processed.keys().
processed['user_id'] is a list containing a list of problem_id's solved by that user.
processed['user_id']['problem_id'] contains list of tuples. Each tuple consists of information about a submission (submission_id,date,language,original_language,filename_ext,status)

To create this, use the followint script (You may need to change the path information):

cd src
cd ../

If there is any file missing Like "" Please check the folder and if it's not there please create an issue. I will make that available ASAP.

Create Language Specific Data (Skip this part if you just want to download our version)

We use the processed.json file to create the training data chunk by chunk (10k per file) and store them in the data folder for individual programming languages. The following code preprocesses and stores both Java and Python data into the json format in folders stored at data/{language}/jsons/.

cd src
cd ../

Or, you can also download the processed.json file, which is the root file for all data generation and processing:

cd data/
wget --load-cookies /tmp/cookies.txt "$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate '' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1gxZYObARqJytI9gf6gEX-CZhCpc4JPE6" -O && rm -rf /tmp/cookies.txt
cd ../

Split The Data (Skip this if you want to continue from our preprocessed files) merges all the json chunks, deduplicates using jaccard similarity function, and splits the data into the train-valid-test (80-10-10) ratio. This is done on the problem level so that no datapoints for a single problem exist in multiple splits, like train and test. During the split, we also mantain the condition that for all the datapoints in the valid and test sets- we have the test cases available so that execution-based evaluation can be done on both the valid and test set data.

cd src
python --lang py --src_file ../data/Python/jsons/ --src_dir ../data/Python/processed/ --out_dir ../data/Python/processed/
cd ../

Download Preprocessed Data

Run the following commands if you want to download the processed data and train:

Download and unzip our preprocessed Java dataset

wget --load-cookies /tmp/cookies.txt "$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate '' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1vsuUrJ2j86EYGb2WWQatqsqJ-V8Sl6en" -O && rm -rf /tmp/cookies.txt

Download and unzip our preprocessed Python dataset

wget --load-cookies /tmp/cookies.txt "$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate '' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1rjjYW8SB8f5Hr34ig84OKpNYOzdt03Ar" -O && rm -rf /tmp/cookies.txt

After successful completion, we derive 4 datasets from this part:

  • java buggy code to java fixed code (data/java/processed/)
  • java buggy code with verdict information to java fixed code (data/java/processed_with_verdict/)
  • python buggy code to python fixed code (data/python/processed/)
  • python buggy code with verdict information to python fixed code in (data/python/processed_with_verdict/)

Each of these 4 directories contains:

  • {train, test, valid}.jsonl files containing all the information for the datapoints. This also allows us to always revert back to the original dataset
  • {train, test, valid}.{language-language}.id files, where language is in the set [java, python]
  • 6 raw test files for training.
  • {src, tgt}_{train, test, valid}.{language-language}.language

Training and Evaluation

Training the model and evaluating on the dataset

GPU is required to run the experiments.

To use our open sourced pretrained models and data files, download or from the link below and verify the results using the same procedure.

And then go to that specific folder and run the command. More instructions later n this page.

cd plbart/

To run the codet5 model, go to the codet5 folder and use the script file. This will also evalualte the model on match-based metrics (BLEU, CodeBleu, Syntax Match, Dataflow Match, etc.). Some changes are required to execute the script:

  • Change the source and target languages on lines 14-15 to one of these ['java', 'python']
  • Change path_2_data at the end of line 22 to the folder name with the processed or processed_with_verdict data
  • Change line 27 to make the Model and Cached data save directory consistent with the data as well. For example, append "_with_verdict" if the associated data path contains "_with_verdict" as well.
    To simply run the evaluation, comment out the train function in the bottom of the file

Each file has a similar structure:


GPU_ID is how many GPUs you want to use. For single GPU, input "0".
SRC_LANGUAGE, TARGET_LANGUAGE are both the same for a single run. They can be either "java" or "python".
DATA_SOURCE is the location of the preprocessed data. For example, "codenet" if the stored preprocessed data folder is named "codenet".
WITH_VERDICT can be either "true" or "false" depending on if you want to use the verdict information in the input or not.

cd codet5/
nohup ./ 0 java java codenet false #TODO Briefly explain one or all of these examples i.e.:
nohup ./ 0 java java codenet true #Executes the Java dataset with one GPU and verdict information
nohup ./ 0 python python codenet false #Executes the Python dataset with one GPU and without verdict information
nohup ./ 0 python python codenet true

Similarly, for training and evaluating the plbart model, navigate to the root directory and use the following:

cd plbart/
nohup ./ 0 java java codenet false
nohup ./ 0 java java codenet true
nohup ./ 0 python python codenet false
nohup ./ 0 python python codenet true

The script for each of the models contains 3 function:

  • train -> Trains that specific model and saves the checkpoints and logs all the necessary matrices.
  • evaluate -> Loads a pretrained model (usually the checkpoint-best-ppl) model and evaluates all metrics except the execution-based evaluation with pass@k accuracy.
  • generate -> Loads a pretrained model (usually the checkpoint-best-ppl) and generates a json file with the predictions from the loaded model.


Evaluate on Execution

This part is not included in the usual evaluation because changes are required based on your system to run this efficiently.

First, run the below commands. These commands will create 4 additional splits in the 4 core data folders (data/language/{processed, processed_with_verdict}) named eval which are similar to train, valid, and test but smaller. The main difference between eval and test set is that {train, test, valid} are created using our split method and all the datapoints are split between these.
But here we create an eval split which is sampled from the test datapoints using, keeping the true data distribution similar to the test file but on a smaller scale (500 in our case) to keep the runtime and computational complexity in check as we need to generate multiple submissions and run each of them with many test cases to calculate our pass@k accuracy.

cd src/
python --with_verdict True
python --lang python
python --with_verdict True --lang python
cd ../

Let's generate the file with the model predictions

Go to the specific model folder and execute the command with only the generate function uncommented and save_dir, path_2_data, and languages set to the correct versions. For example:

cd plbart/

To use our open sourced pretrained models, download or from the link below and verify the results using the same procedure.

Pre-preprocess the generated files that contains all tokenized and detokenized source, target, and predictions

First, we need to create a self-contained json with all of the necessary versions to detokenize the code and execute. We split this portion explicitly because it is not possible to run the code and install all the libraries required to tokenize the Java and Python programs using the ARC (Advanced Research Computing) supercomputer at Virginia Tech. Thus, we do it elsewhere and create the resulting json file which can be used to generate results.

cd src/
python --references data/java/processed/generation.json --language java
python --references data/java/processed_with_verdict/generation.json --language java
python --references data/python/processed/generation.json --language python
python --references data/python/processed_with_verdict/generation.json --language python
cd ../

These will create 4 json files. You may need to change the output file names for your own clarification.

Finally, let's run the code to execute and evaluate

First, we expect the test cases folder and the "problem_list.csv" file to be in the root directory. So let's copy those:

cp -r data/atcoder_test_cases atcoder_test_cases
cp Project_CodeNet/metadata/problem_list.csv problem_list.csv 

Now, let's run the execute and evaluate methods:

python evaluation/ --references test_python2python_with_verdict_output.jsonl --language python --test_cases atcoder_test_cases --problem_list problem_list.csv

To run on ARC, we provide a file for using in slurm clusters where you might need to change your credentials.


The previous commands will create a json file which contains all the fields necessary for visualizing and getting pass@k accuracy.

Use to get the results

We can use to generate the results. We can also use the previous json in the src/01_preprocessing.ipynb notebook for visualizing.


Match-based metrics

We evaluate the models' performances on the test set in terms of Compilation Accuracy (CA), BLEU, Syntax Match (SM), Dataflow Match (DM), CodeBLEU (CB), and Exact Match (EM). We report the model performances below.

Method Language Verdict BLEU EM SM DM CB CA
Naive Copy Java No 80.28 0.03 84.22 53.64 75.43 89.93
Python No 68.55 0.73 70.12 60.51 68.47 96.56
PLBART Java No 58.49 0.45 66.92 43.08 57.23 31.36
Yes 59.84 1.46 68.01 44.99 58.62 33.04
Python No 61.89 2.32 64.32 48.81 61.13 91.16
Yes 62.25 2.46 63.31 49.73 62.21 92.21
CodeT5 Java No 62.31 2.96 74.01 52.30 63.37 63.03
Yes 62.54 2.45 73.93 53.29 63.71 64.23
Python No 64.92 2.74 68.79 56.21 63.53 92.80
Yes 64.67 2.97 68.45 56.04 63.28 92.70

Execution-based metrics

We also evaluate our model using pass@k and test case average. Here are the benckmark results:

Language Verdict pass@k TCA@k
k = 1 k = 3 k = 5 k = 10 k = 1 k = 3 k = 5 k = 10
Java No 8.65 15.62 19.63 24.44 41.00 34.00 32.70 29.60
Yes 10.94 18.77 22.66 27.96 44.99 38.80 35.87 32.90
Python No 6.86 13.07 16.27 20.51 50.20 41.20 38.50 35.20
Yes 7.32 13.94 17.47 22.63 58.75 41.16 38.37 34.88


MIT License

Copyright (c) 2022 Md. Mahim Anjum Haque

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.


  title={FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems},
  author={Haque, Md Mahim Anjum and Ahmad, Wasi Uddin and Lourentzou, Ismini and Brown, Chris},
  journal={arXiv preprint arXiv:2206.07796},


We introduce FixEval , a dataset for competitive programming bug fixing along with a comprehensive test suite and show the necessity of execution based evaluation compared to suboptimal match based evaluation metrics like BLEU, CodeBLEU, Syntax Match, Exact Match etc.







No releases published


No packages published