Skip to content

Code for my AI Safety Fundamentals project on sycophancy and bias in fine-tuned LLMs.

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
Unknown
LICENSE-CC-BY-NC
Notifications You must be signed in to change notification settings

kryjak/sycophancy_study

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Not all biases are equal -- a study of sycophancy and bias in fine-tuned LLMs

Code for my AI Safety Fundamentals project on introducing sycophancy and biases into LLMs through synthetic data fine-tuning. Project report can be found at main.pdf.

Code

Running the code requires the datasets package.

Remember to set the API key of the LLM provider as appropriate. Currently, only OpenAI support is implemented. By default, the key is passed through the environmental variable OPENAI_API_KEY.

To enable W&B integration for tracking fine-tuning job details:

  • set your W&B entity through the environmental variable WANDB_ENTITY
  • set the W&B API key in your OpenAI Platform account (settings -> organization -> general -> integrations -> Weights and Biases)

By default, W&B integration is switched off and can be enabled by setting WANDB_INTEGRATION = True in config.py.

To replicate our results, after cloning execute:

cd code
python pipeline.py

WARNING: The full pipeline, with all parameters set to default values, should cost around $25 to run.

File description

Main pipeline:

  • pipeline.py -- main file that integrates the full pipeline.

Its components can be executed separately to complete individual subtasks:

  • download_and_filter_data.py -- downloads datasets from HuggingFace and filters out statements that the model is not able to correctly answer.
  • prepare_data.py -- generate prompts for fine-tuning and experiments
  • fine_tune.py -- submit fine-tuning jobs through an API
  • run_experiments.py -- run experiments on the original and fine-tuned models
  • analyse_results.py -- extract information from the results and create plots

Other files:

  • config.py -- main config file to specify parameters such as the model, number of prompts, batch size, W&B integration, etc.
  • [provider]_interface.py -- functions to interact with the API of a given provider such as OpenAI
  • [provider]_finetuning_config.py -- file to create config for fine-tuning jobs from pre-defined parameters
  • axes_and_classes.py -- a list of axes and classes used in the study (see the report)
  • experiment_list.py -- a list of experiments to test the models on
  • pull_from_huggingface.py -- functions to pull datasets from HuggingFace
  • utils.py -- utility functions

Finally, the open-ended NLP statements of Michael et al., 2022 are collected for convenience in nlp_statements_openended.json.

License

This project contains code licensed under the Apache License 2.0 and original contributions licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

For a list of changes made to the original repository, see NOTICES.md.

Contributing

Feel free to clone, fork or open an issue. Code subject to the CC BY-NC 4.0 license.

Contact

Jakub Kryś

[email protected]

About

Code for my AI Safety Fundamentals project on sycophancy and bias in fine-tuned LLMs.

Resources

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
Unknown
LICENSE-CC-BY-NC

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TeX 56.5%
  • Python 43.5%