This is the replication package for the paper titled Detecting Semantic Clones of Unseen Functionality that will be presented at ASE '25.
- Structure
- Setup
- Evaluating on Unseen Functionalities (RQ1)
- Improvements with Contrastive Learning (RQ2)
- Generative LLM Experiments (RQ1 + RQ2)
The three task-specific models we use in our main experiments are CodeBERT [1], ASTNN [2], and CodeGrid [3], and each of them has its own folder (codebert/, astnn/, and codegrid/ respectively). Functions are shared between the models (mainly data loading) are in the utilities/ folder, while all the datasets used are available in the datasets/ folder. The oj_datasets_experiments/ folder contains code for our additional experiments of Section IV.C. (Evaluating on Strict Clones), and it also contains a dedicated README.md on how to run these experiments. Finally, the code, results, and prompts for the generative LLM experiments can be found in the llms/ directory.
Finally, the introduced dataset BCBs' can be found in datasets/bcb_v2_sampled_bf/data_bcb_v2_sampled_bf.pickle, while the script to reproduce the sampling that leads to this dataset is in utilities/bcb_utilities.py.
We tested the code on macOS Sonoma 14 and Ubuntu with Python 3.9.6. To run the code, first create a virtual environment
python3.9 -m venv .env
source .env/bin/activateand install the requirements
pip install -r requirements.txtIn Research Question (RQ) 1, we evaluate the three models on functionalities unseen during training; we compare their performance to the performance reported in their original papers. We use three evaluation methods, namely one-vs-rest, train-bcb-test-scb, and train-scb-test-bcb. To run the experiments for RQ2 (for the task-specific models, for the LLMs see below), follow these instructions.
cd codebert
python 1vsRest_baseline.py
python bcb_scb_baseline.py --train_on bcb
python bcb_scb_baseline.py --train_on scbcd ../astnn
python pipeline_1vsRest.py
python pipeline_bcb_scb.py
python 1vsRest_baseline.py
python bcb_scb_baseline.py --train_on bcb
python bcb_scb_baseline.py --train_on scbcd ../codegrid
python 1vsRest_baseline.py
python bcb_scb_baseline.py --train_on bcb
python bcb_scb_baseline.py --train_on scbSome experiments are computationally intensive, especially the 1vsRest___.py scripts that involve a sequence of experiments. For convenience, we also provide the Slurm files that we used to run these experiments in a Slurm cluster. They can be found under the slurm/ folder of each model's directory. However, the code runs without Slurm as well.
In RQ2, we replace the final classifier of the three models with a contrastive classifier (for detailed architecture see our paper) and re-evaluate. We compare their performance to their performance in RQ2, to test whether and by how much contrastive learning improves performance on unseen functionalities. To run the experiments for RQ2 (for the task-specific models, for the LLMs see below), follow these instructions.
cd codebert
python 1vsRest_siamese.py
python bcb_scb_siamese.py --train_on bcb
python bcb_scb_siamese.py --train_on scbcd ../astnn
python 1vsRest_siamese.py
python bcb_scb_siamese.py --train_on bcb
python bcb_scb_siamese.py --train_on scbcd ../codegrid
python 1vsRest_siamese.py
python bcb_scb_siamese.py --train_on bcb
python bcb_scb_siamese.py --train_on scbExecuting run_all_experiments.sh will run all the experiments mentioned in the paper that involve generative LLMs. An OpenAI API key is required for the OpenAI models and a Groq key for DeepSeek, which you can set by running export OPENAI_API_KEY=xxx and export GROQ_API_KEY=xxx respectively.