This project aims to develop Haystack, an open-source platform for red teaming and human feedback on LLMs with crowd-sourcing and automated methods
- build a frontend for Pythia model evaluation
- develop a model evaluation interfaces for red teaming from scratch
- develop a representational view of interpretability
- organise open source human feedback data for sharing between models.
Despite there being many efforts to red team language models, there aren't any available open source frameworks for client-level user model evaluation and red teaming testing.
- White House
- UK (Royal Society)
- The Trojan Detection MIT Challenge 2023
- DEF CON
- Next.js
- tailwindcss
- Deployed on Vercel
Assessment Type | Description |
---|---|
Capabilities Assessment | Benchmark performance on representative tasks and datasets. Measure capabilities like accuracy, robustness, efficiency. Identify strengths, limitations, and gaps. |
Adversarial Testing | Probe with malformed, adversarial inputs. Check for crashes, unintended behavior, security risks. Informed by threat models, risk analysis. |
Red Teaming | Model potential real-world risks and failures. Role play adversary perspectives. Surface risks unique to AI. |
Human Oversight | Manual test cases based on human judgment. |
Mosaic Prompt : breakdown a prompt into permissible components
- Users break down impermissible content into small permissible components.
- Each component is queried independently and appears harmless.
- User recombines components to reconstruct impermissible content.
- Exploits compositionality of language.
- The attack involves translating unsafe English input prompts into low-resource natural languages using Google Translate.
- Low-resource languages are those with limited training data, like Zulu.
- The translated prompts are sent to GPT-4, which then responds unsafely instead of refusing.
- The attack exploits uneven multilingual training of GPT-4's safety measures.