PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
This repository contains the implementation of our ICML 2025 paper PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling (paper).
Many-shot jailbreaking prefixes the malicious target prompt with hundreds of fabricated conversational exchanges, making it appear as though the model has already complied with harmful instructions.
PANDAS improves many-shot jailbreaking using:
- Positive Affirmation — Inserted before the next malicious question to reinforce instruction-following behaviors (green).
- Negative Demonstration — A refusal followed by a correction phrase, steering the model away from refusal (red).
- Adaptive Sampling — Malicious demonstrations are selected based on the topics of the target prompt.
Submit a request to access the ManyHarm dataset. We typically respond within 1–2 days. After approval, place the downloaded .csv
files in: ./dataset/ManyHarm
We use HuggingFace-compatible model weights such as Meta-Llama-3.1-8B. Ensure your model supports chat templates and correct role names.
python3 main_jailbreak.py --dataset harmbench -m Meta-Llama-3.1-8B-Instruct -c all \
--positive_affirmation off --negative_demonstration off --adaptive_sampling off \
--seed 0 --num_restart 3 --max_shot 256 -d <path to the result directory>
python3 main_jailbreak.py --dataset harmbench -m Meta-Llama-3.1-8B-Instruct -c all \
--positive_affirmation random --negative_demonstration first --adaptive_sampling percentage_of_1 \
--seed 0 --num_restart 3 --max_shot 256 -d <path to the result directory>
python3 main_bayes_opt.py -c abuse-platform -s 64 -n 5 -i 50 --dataset advbench -m Meta-Llama-3.1-8B-Instruct \
--init_points 10 --debug 0 --using_logit 0 --char 1 -d <path to the result directory>
If you find this useful in your research, please consider citing:
@inproceedings{ma2025pandas,
title={{PANDAS}: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling},
author={Ma, Avery and Pan, Yangchen and Farahmand, Amir-massoud},
booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
year={2025},
}
MIT License