diff --git a/2024/2024_01_08_FLIRT_Feedback_Loop_In-context_Red_Teaming/FLIRT.pdf b/2024/2024_01_08_FLIRT_Feedback_Loop_In-context_Red_Teaming/FLIRT.pdf new file mode 100644 index 0000000..da6598e Binary files /dev/null and b/2024/2024_01_08_FLIRT_Feedback_Loop_In-context_Red_Teaming/FLIRT.pdf differ diff --git a/2024/2024_01_08_FLIRT_Feedback_Loop_In-context_Red_Teaming/README.md b/2024/2024_01_08_FLIRT_Feedback_Loop_In-context_Red_Teaming/README.md new file mode 100644 index 0000000..c4e550d --- /dev/null +++ b/2024/2024_01_08_FLIRT_Feedback_Loop_In-context_Red_Teaming/README.md @@ -0,0 +1,5 @@ +# FLIRT: Feedback Loop In-context Red Teaming + +As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. In this paper, the authors propose FLIRT: an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities against unsafe and inappropriate content generation. Contrary to the currently available solutions, FLIRT fully automates the feedback loop responsible for offensive content generation and does not follow a human-in-the-loop approach. The experiments demonstrate that compared to baseline approaches, the proposed strategy is significantly more effective in exposing vulnerabilities in the Stable Diffusion (SD) model, even when the latter is enhanced with safety features. + +The presentation is based on this [paper](https://openreview.net/forum?id=JTBe1WG3Ws) diff --git a/README.md b/README.md index b1603cb..3682e90 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ Join us at https://meet.drwhy.ai. * 04.12.2023 - Discussion - RedTeaming of foundation models * 11.12.2023 - [Introduction to Diffusion Models](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2023/2023_12_11_intro_to_diffusion_models) - Bartek Sobieski * 18.12.2023 - [Glaze: Protecting artists from style mimicry by text-to-image model](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2023/2023_12_18_glaze_protecting_artists_from_style_mimicry) - Tymoteusz Kwieciński -* 08.01.2024 - TBD - Hubert Ruczyński +* 08.01.2023 - [FLIRT: Feedback Loop In-context Red Teaming](https://github.com/HubertR21/MI2DataLab_Seminarium/tree/patch-2/2024/2024_01_08_FLIRT_Feedback_Loop_In-context_Red_Teaming) - Hubert Ruczyński * 15.01.2024 - [Red-Teaming the Stable Diffusion Safety Filter](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2023/2024_01_15_red_teaming_stable_diffusion_safety_filter) - Mateusz Grzyb * 22.01.2024 - Discussion - Diffusion models for XAI