Skip to content

Official code for the paper: From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

License

Notifications You must be signed in to change notification settings

for-ai/MemoryCode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

25ae8ab · Feb 24, 2025

History

6 Commits
Feb 24, 2025
Feb 24, 2025
Feb 24, 2025
Feb 24, 2025
Feb 18, 2025
Feb 24, 2025
Feb 18, 2025
Feb 18, 2025
Feb 24, 2025
Feb 24, 2025

Repository files navigation

MemoryCode

dataset creation

Key terms

  • A dialogue is composed of multiple sessions. A session is composed of multiple turns.
  • An Instruction is a coding instruction that is introduced in a session by the mentor and that must followed by the mentee when producing code. It can be updated throughout the dialog history. Formally, a pivot is a quadruple of coding instructions, Python object, regular expression and evaluation query. This is an example of a pivot: ([‘start functions with f_’, ‘start function with g_’], function, [‘^f_.*’, ‘^g_.*’], function that merges two lists).
  • A filler is a topic not related to coding instructions. It can also be updated during the dialog history.

Dataset generation

Dataset generation can be divided into 3 stages: template generation, prompt generation, dialog generation.

The topics.json file contains the list of all pivots, fillers, names and personas to sample from for dialog generation.

The generate_template.py script takes as input the topics.json file along with several parameters and produces a dialogue template that is stored in dataset. Given a template, the generate_prompt.py script produces the corresponding prompt file in prompts. These prompts are then fed to an LLM using the generate_dialogue.py script to produce the dialogues.

Run the scripts/generate_dataset.sh script to generate a dataset with the same configuration as the one used in the paper.

Evaluation

Run the scripts/generate_model_output.sh script to generate the model outputs. The evaluate_model_output.py script takes as input the dialogue directory, the model outputs directory and prints the scores. For example, to evaluate gpt-4o, run the following command:

python code/evaluate_model_output.py --dialogue_dir dataset --model_output_dir outputs/gpt-4o

About

Official code for the paper: From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published