Skip to content

equiano-institute/haystack

Repository files navigation

Red Teaming LLMs

This project aims to develop Haystack, an open-source platform for red teaming and human feedback on LLMs with crowd-sourcing and automated methods

Goals

Challenges & Efforts

Despite there being many efforts to red team language models, there aren't any available open source frameworks for client-level user model evaluation and red teaming testing.


Built with

Red Teaming

Assessment Type Description
Capabilities Assessment Benchmark performance on representative tasks and datasets. Measure capabilities like accuracy, robustness, efficiency. Identify strengths, limitations, and gaps.
Adversarial Testing Probe with malformed, adversarial inputs. Check for crashes, unintended behavior, security risks. Informed by threat models, risk analysis.
Red Teaming Model potential real-world risks and failures. Role play adversary perspectives. Surface risks unique to AI.
Human Oversight Manual test cases based on human judgment.

Attack Examples

Mosaic Prompt : breakdown a prompt into permissible components

  • Users break down impermissible content into small permissible components.
  • Each component is queried independently and appears harmless.
  • User recombines components to reconstruct impermissible content.
  • Exploits compositionality of language.

Red Image

Cross-Lingual Attacks : translating between high and low-resource languages for attacking multi-lingual capability

  • The attack involves translating unsafe English input prompts into low-resource natural languages using Google Translate.
  • Low-resource languages are those with limited training data, like Zulu.
  • The translated prompts are sent to GPT-4, which then responds unsafely instead of refusing.
  • The attack exploits uneven multilingual training of GPT-4's safety measures.

Low Resource

About

A suite of red teaming and evaluation frameworks for language models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published