This is the official repository of the USENIX 2025 paper Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities.
Disclaimer: This repo contains examples of hateful and abusive language. Reader discretion is recommended. This repo is intended for research purposes only. Any misuse is strictly prohibited.
bash setup.sh
conda activate llavaThen set up your OpenAI API KEY and Hugging Face Token:
export OPENAI_API_KEY=sk-xxx
export HF_TOKEN=hf_xxx
⚠️ Attention: This dataset contains unsafe content, so it requires access on Hugging Face. To use this dataset, first apply for access, then fill in your name, affiliation, and ensure that you will only use it for research or education.
After we grant the access, you can use it as follows.
from datasets import load_dataset
dataset = load_dataset("yiting/UnsafeConcepts", split="train")
Take llava-v1.5-7b as an exmaple:
python measure.py --model_name llava-v1.5-7b --capability perception --response_dir outputs
python measure.py --model_name llava-v1.5-7b --capability alignment --response_dir outputs
python measure.py --model_name llava-v1.5-7b --capability alignment_text_only --response_dir outputsTo query all VLMs:
bash scripts/query_vlms.sh where the environment is setup indenpendently for each VLM.
- Remeber to fill the working directory (DIR) and the specific capability in scripts/query_vlms.sh
We use fine-tuned RoBERTa classifiers to classify VLM-generated responses.
We provide these checkpints in huggingface, and they will be automatically downloaded during the evaluation process.
python summarize_measure.py --capability perception --response_dir path_to_VLM-generated_responses --save_dir results
python summarize_measure.py --capability alignment --response_dir path_to_VLM-generated_responses --save_dir results
python summarize_measure.py --capability alignment_text_only --response_dir path_to_VLM-generated_responses --save_dir resultscd RLHF
python build_training_data.pyThis will procude training datasets tailored for PPO, SFT, and DPO
We run the training scripts on a single A100 (80G). You can choose to do parallel training on multiple GPUs by editing the gpus_per_node, world_size, and batch_size, etc.
bash scripts/train_ppo.sh
bash scripts/train_sft.sh
bash scripts/train_dpo.shcd ..
python eval_rlhf.py --dataset_name UnsafeConcepts_TEST --lora_path the_trained_lora_checkpoint --save_dir outputsAggregate all evaluations on alignment capability, general capabilities, and generalization to other datasets:
bash scripts/eval.shIf you find this useful in your research, please consider citing:
@inproceedings{QBZ25,
author = {Yiting Qu and Michael Backes and Yang Zhang},
title = {{Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities}},
booktitle = {{USENIX Security Symposium (USENIX Security)}},
publisher = {USENIX},
year = {2025}
}
The RLHF training scripts are largely inspired from LLaVA-RLHF.
This project is licensed under the MIT License.