This project explores the transferability of adversarial attacks between different vision-language models (VLMs). It focuses on generating adversarial examples using one VLM and testing their effectiveness on another, with a particular emphasis on jailbreak attacks. The project utilizes models such as DeepSeek-VL and LLaVA, and includes implementations for data preparation, attack generation, and result analysis.
To set up the project environment, run the following commands after cloning:
cd Pivotal_project
python3.10 -m venv .py310
source .py310/bin/activate
pip install -r requirements.txt
python3.10 -m ipykernel install --user --name=custom_venv --display-name "Custom Venv"
You might have to reload your editor window in order for the editor to recognise the new kernel. In VSCode, open the Command Palette (Ctrl+Shift+P) and select "Developer: Reload Window".
You might also want to download the bash script setup_dev_environment.sh
before cloning, run it and follow the instructions. It will:
- [optional] install a text editor (Vim/Emacs)
- install Python 3.10
- [optional] create a new folder for the project
- clone the project
The terminal commands listed above are printed at the end of the script.
Before running the code, make sure to set the following environment variables:
WANDB_KEY
: for Weights & Biases integrationWANDB_ENTITY
: your Weights & Biases usernameHF_TOKEN
: for Hugging Face model accessOPENAI_API_KEY
: for OpenAI API accessANTHROPIC_API_KEY
: for Anthropic API access
In bash, you can set these by adding the following lines to your .bashrc
file:
export KEY_NAME='your_key_here'
NOTE: Currently, the OpenAI and Anthropic API keys are not used for anything. They will be needed later when we add support for evaluating the jailbreak outputs using such models.
setup_dev_environment.sh
: Script to set up the development environmentmain.ipynb
: Main notebook with usage examplesprepare_advbench_mini.py
: Script to create a smaller version of the AdvBench dataset for testingVLM_base_classes.py
: Base classes for Vision-Language Modelsattacks.py
: Implementation of jailbreak attacks on VLMsmodels.py
: Functions for loading the modelsconfig.py
: Configuration settings for the project, including model selection and attack parametersdata.py
: Custom DataLoader implementation for VLM jailbreak attacksutils.py
: Utility functions, including WandB integration helperscustom_image_transforms.py
: Custom image transformation functions for adversarial attacksrequirements.txt
: List of Python package dependencies
All of the code in this repository should be suitable for a single A100 GPU (although you might struggle to train the attacks on an ensemble of contexts while using only 40GB of memory).
To see the code in action, look at the examples in the main.ipynb
notebook.
This project is a work in progress and bugs are to be expected.