Skip to content

A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the safety implications, challenges, and advancements surrounding these powerful models.

Notifications You must be signed in to change notification settings

ydyjya/Awesome-LLM-Safety

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛡️Awesome LLM-Safety🛡️Awesome

GitHub stars GitHub forks GitHub issues GitHub Last commit

English | 中文

🤗Introduction

Welcome to our Awesome-llm-safety repository! 🥰🥰🥰

🔥 News

  • 2024.05 update NAACL 2024 Papers Collection, thanks @zhrli324, @feqHe!

🧑‍💻 Our Work

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.

If a resource is relevant to multiple subcategories, we place it under each applicable section. For instance, the "Awesome-LLM-Safety" repository will be listed under each subcategory to which it pertains🤩!.

✔️ Perfect for Majority

  • For beginners curious about llm-safety, our repository serves as a compass for grasping the big picture and diving into the details. Classic or influential papers retained in the README provide a beginner-friendly navigation through interesting directions in the field;
  • For seasoned researchers, this repository is a tool to keep you informed and fill any gaps in your knowledge. Within each subtopic, we are diligently updating all the latest content and continuously backfilling with previous work. Our thorough compilation and careful selection are time-savers for you.

🧭 How to Use this Guide

  • Quick Start: In the README, users can find a curated list of select information sorted by date, along with links to various consultations.
  • In-Depth Exploration: If you have a special interest in a particular subtopic, delve into the "subtopic" folder for more. Each item, be it an article or piece of news, comes with a brief introduction, allowing researchers to swiftly zero in on relevant content.

💼 How to Contribution

If you have completed an insightful work or carefully compiled conference papers, we would love to add your work to the repository.

  • For individual papers, you can raise an issue, and we will quickly add your paper under the corresponding subtopic.
  • If you have compiled a collection of papers for a conference, you are welcome to submit a pull request directly. We would greatly appreciate your contribution. Please note that these pull requests need to be consistent with our existing format.

📜Advertisement

🌱 If you would like more people to read your recent insightful work, please contact me via email. I can offer you a promotional spot here for up to one month.

Let’s start LLM Safety tutorial!


🚀Table of Contents


🤔AI Safety & Security Discussions

Date Link Publication Authors
2024/5/20 Managing extreme AI risks amid rapid progress Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann Science

🔐Security & Discussion

📑Papers

Date Institute Publication Paper
20.10 Facebook AI Research arxiv Recipes for Safety in Open-domain Chatbots
22.03 OpenAI NIPS2022 Training language models to follow instructions with human feedback
23.07 UC Berkeley NIPS2023 Jailbroken: How Does LLM Safety Training Fail?
23.12 OpenAI Open AI Practices for Governing Agentic AI Systems

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
22.02 Toxicity Detection API Perspective API link
paper
23.07 Repository Awesome LLM Security link
23.10 Tutorials Awesome-LLM-Safety link
24.01 Tutorials Awesome-LM-SSP link

Other

👉Latest&Comprehensive Security Paper


🔏Privacy

📑Papers

Date Institute Publication Paper
19.12 Microsoft CCS2020 Analyzing Information Leakage of Updates to Natural Language Models
21.07 Google Research ACL2022 Deduplicating Training Data Makes Language Models Better
21.10 Stanford ICLR2022 Large language models can be strong differentially private learners
22.02 Google Research ICLR2023 Quantifying Memorization Across Neural Language Models
22.02 UNC Chapel Hill ICML2022 Deduplicating Training Data Mitigates Privacy Risks in Language Models

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
23.10 Tutorials Awesome-LLM-Safety link
24.01 Tutorials Awesome-LM-SSP link

Other

👉Latest&Comprehensive Privacy Paper


📰Truthfulness & Misinformation

📑Papers

Date Institute Publication Paper
21.09 University of Oxford ACL2022 TruthfulQA: Measuring How Models Mimic Human Falsehoods
23.11 Harbin Institute of Technology arxiv A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
23.11 Arizona State University arxiv Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
23.07 Repository llm-hallucination-survey link
23.10 Repository LLM-Factuality-Survey link
23.10 Tutorials Awesome-LLM-Safety link

Other

👉Latest&Comprehensive Truthfulness&Misinformation Paper


😈JailBreak & Attacks

📑Papers

Date Institute Publication Paper
20.12 Google USENIX Security 2021 Extracting Training Data from Large Language Models
22.11 AE Studio NIPS2022(ML Safety Workshop) Ignore Previous Prompt: Attack Techniques For Language Models
23.06 Google arxiv Are aligned neural networks adversarially aligned?
23.07 CMU arxiv Universal and Transferable Adversarial Attacks on Aligned Language Models
23.10 University of Pennsylvania arxiv Jailbreaking Black Box Large Language Models in Twenty Queries

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
23.01 Community Reddit/ChatGPTJailbrek link
23.02 Resource&Tutorials Latest Jailbreak Prompts link
23.10 Tutorials Awesome-LLM-Safety link
23.10 Article Adversarial Attacks on LLMs(Author: Lilian Weng) link
23.11 Video [1hr Talk] Intro to Large Language Models
From 45:45(Author: Andrej Karpathy)
link
24.09 Repo awesome_LLM-harmful-fine-tuning-papers link
12.10 Resource Jailbreak Commuinities link
12.10 Article Jailbreak Techniques and Safeguards link

Other

👉Latest&Comprehensive JailBreak & Attacks Paper


🛡️Defenses & Mitigation

📑Papers

Date Institute Publication Paper
21.07 Google Research ACL2022 Deduplicating Training Data Makes Language Models Better
22.04 Anthropic arxiv Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
23.10 Tutorials Awesome-LLM-Safety link

Other

👉Latest&Comprehensive Defenses Paper


💯Datasets & Benchmark

📑Papers

Date Institute Publication Paper
20.09 University of Washington EMNLP2020(findings) RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
21.09 University of Oxford ACL2022 TruthfulQA: Measuring How Models Mimic Human Falsehoods
22.03 MIT ACL2022 ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech Detection

📖Tutorials, Articles, Presentations and Talks

Date Type Title URL
23.10 Tutorials Awesome-LLM-Safety link

📚Resource📚

Other

👉Latest&Comprehensive datasets & Benchmark Paper


🧑‍🎓Author

🤗If you have any questions, please contact our authors!🤗

✉️: ydyjya ➡️ [email protected]

💬: LLM Safety Discussion


Star History Chart

⬆ Back to ToC

About

A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the safety implications, challenges, and advancements surrounding these powerful models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published