This repository is the replication package of the work "Automating Code-Related Tasks Through Transformers: The Impact of Pre-training"
The SLR folder contains the material from the sistematic literature review. In particular:
SLR/queries.numberscontains the queries executed for each source;SLR/datacontains the collected papers.
The code folder contains the scripts to reproduce our experiments: In particular:
code/trainingcontains the Google Colab scripts to run the pre-training and the fine-tuning. Note that you need a Pro Goggle Colab account tu succesfully run the scripts (on the TPUs);code/cleaningcontains the scripts we used to clean the dataset;code/generate_mutantscontains all the necessary to generate mutants of given Java methods.code/tokenizercontains the tokenizer model and vocabulary.
The results folder contains statistical analysis, BLEU score and Levenstein distance of the models predictions.
We stored all the processed data (pre-training datasets and fine-tuning datasets) and all the trained models checkpoints (for each model we stored the final/best chekpoint only) on Zenodo, available at the following links:
- datasets: https://zenodo.org/record/7052859#.YyGtUewzZoY;
- models chekpoints: https://zenodo.org/record/7078746#.YyGuG-wzZoY;