Skip to content

git-disl/awesome_LLM-harmful-fine-tuning-papers

Repository files navigation

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

PRs Welcome Visits Badge Stars Forks

🔥 Must-read papers for harmful fine-tuning attacks/defenses for LLMs.

💫 Continuously update on a weekly basis. (last update: 2024/12/26)

🔥 Good news: 7 harmful fine-tuning related papers are accpeted by NeurIPS2024

💫 We have updated our survey, including the discussion on the 17 ICLR2025 new submissions.

🔥 We update a slide to introduce harmful fine-tuning attacks/defenses. Check out the slide here.

Content

Attacks

  • [2023/10/4] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models arXiv [paper] [code]

  • [2023/10/5] Fine-tuning aligned language models compromises safety, even when users do not intend to! ICLR 2024 [paper] [code]

  • [2023/10/5] On the Vulnerability of Safety Alignment in Open-Access LLMs ACL2024 (Findings) [paper]

  • [2023/10/31] Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b SeT LLM workshop@ ICLR 2024 [paper]

  • [2023/11/9] Removing RLHF Protections in GPT-4 via Fine-Tuning NAACL2024 [paper]

  • [2024/4/1] What's in your" safe" data?: Identifying benign data that breaks safety COLM2024 [paper] [code]

  • [2024/6/28] Covert malicious finetuning: Challenges in safeguarding llm adaptation ICML2024 [paper]

  • [2024/7/29] Can Editing LLMs Inject Harm? NeurIPS2024 [paper] [code]

  • [2024/10/21] The effect of fine-tuning on language model toxicity NeurIPS2024 Safe GenAI workshop [paper]

  • [2024/10/23] Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks arXiv [paper]

Defenses

Alignment Stage Defenses

  • [2024/2/2] Vaccine: Perturbation-aware alignment for large language model aginst harmful fine-tuning NeurIPS2024 [paper] [code]

  • [2024/5/23] Representation noising effectively prevents harmful fine-tuning on LLMs NeurIPS2024 [paper] [code]

  • [2024/5/24] Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation ICLR2025 Submission [paper] [code] [Openreview]

  • [2024/8/1] Tamper-Resistant Safeguards for Open-Weight LLMs ICLR2025 Submission [Openreview] [paper] [code]

  • [2024/9/3] Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation ICLR2025 Submission [paper] [code] [Openreview]

  • [2024/9/26] Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning NeurIPS2024 (for diffusion model) [paper]

  • [2024/10/05] Identifying and Tuning Safety Neurons in Large Language Models ICLR2025 Submission [Openreview]

  • [2024/10/13] Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation arXiv [paper] [code]

Fine-tuning Stage Defenses

  • [2023/8/25] Fine-tuning can cripple your foundation model; preserving features may be the solution TMLR [paper] [code]

  • [2023/9/14] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ICLR2024 [paper] [code]

  • [2024/2/3] Safety fine-tuning at (almost) no cost: A baseline for vision large language models ICML2024 [paper] [code]

  • [2024/2/7] Assessing the brittleness of safety alignment via pruning and low-rank modifications ME-FoMo@ICLR2024 [paper] [code]

  • [2024/2/22] Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment NeurIPS2024 [paper] [code]

  • [2024/2/28] Keeping llms aligned after fine-tuning: The crucial role of prompt templates NeurIPS2024 [paper] [code]

  • [2024/5/28] Lazy safety alignment for large language models against harmful fine-tuning NeurIPS2024 [paper] [code]

  • [2024/6/10] Safety alignment should be made more than just a few tokens deep ICLR2025 Submission [paper] [code] [Openriew]

  • [2024/6/12] Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models ICLR2025 Submission [paper] [Openreview]

  • [2024/8/27] Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models ICLR2025 Submission [Openreview] [paper]

  • [2024/8/30] Safety Layers in Aligned Large Language Models: The Key to LLM Security ICLR2025 Submission [Openreview] [paper]

  • [2024/10/05] SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection ICLR2025 Submission [Openreview]

  • [2024/10/05] Safety Alignment Shouldn't Be Complicated ICLR2025 Submission [Openreview]

  • [2024/10/05] SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation ICLR2025 Submission [Openreview]

  • [2024/10/05] Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning ICLR2025 Submission [paper] [Openreview]

  • [2024/10/13] Safety-Aware Fine-Tuning of Large Language Models NeurIPS 2024 Workshop on Safe Generative AI [paper]

  • [2024/12/19] RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response arXiv [paper]

Post-Fine-tuning Stage Defenses

  • [2024/3/8] Defending Against Unforeseen Failure Modes with Latent Adversarial Training arXiv [paper] [code]

  • [2024/5/15] A safety realignment framework via subspace-oriented model fusion for large language models KBS [paper] [code]

  • [2024/5/23] MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability NeurIPS2024 [paper] [code]

  • [2024/5/27] Safe lora: the silver lining of reducing safety risks when fine-tuning large language models NeurIPS2024 [paper]

  • [2024/8/18] Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning arXiv [paper]

  • [2024/10/05] Locking Down the Finetuned LLMs Safety ICLR2025 Submission [Openreview]

  • [2024/10/05] Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models ICLR2025 Submission [Openreview]

  • [2024/12/15] Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models arXiv [paper]

  • [2024/12/17] NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning AAAI2025 [paper] [code]

Mechanical Study

  • [2024/5/25] No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks arXiv [paper]
  • [2024/5/27] Navigating the safety landscape: Measuring risks in finetuning large language models NeurIPS2024 [paper]
  • [2024/10/05] Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning LLMs ICLR2025 Submission [Openreview]
  • [2024/10/05] On Evaluating the Durability of Safeguards for Open-Weight LLMs ICLR2025 Submission [Openreview] [Code]
  • [2024/11/13] The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense arXiv [paper]

Benchmark

  • [2024/9/19] Defending against Reverse Preference Attacks is Difficult arXiv [paper] [code]

Attacks and Defenses for Federated Fine-tuning

  • [2024/6/15] Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models ICLR2025 Submission [paper] [Openreview]
  • [2024/11/28] PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning arXiv [paper]

Other awesome resources on LLM safety

Citation

If you find this repository useful, please cite our paper:

@article{huang2024harmful,
  title={Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey},
  author={Huang, Tiansheng and Hu, Sihao and Ilhan, Fatih and Tekin, Selim Furkan and Liu, Ling},
  journal={arXiv preprint arXiv:2409.18169},
  year={2024}
}

Contact

If you discover any papers that are suitable but not included, please contact Tiansheng Huang ([email protected]).

Releases

No releases published

Packages

No packages published