Red Teaming LLMs

This project aims to develop Haystack, an open-source platform for red teaming and human feedback on LLMs with crowd-sourcing and automated methods

Goals

build a frontend for Pythia model evaluation
develop a model evaluation interfaces for red teaming from scratch
develop a representational view of interpretability
organise open source human feedback data for sharing between models.

Challenges & Efforts

Despite there being many efforts to red team language models, there aren't any available open source frameworks for client-level user model evaluation and red teaming testing.

Built with

Next.js
tailwindcss
Deployed on Vercel

Red Teaming

Assessment Type	Description
Capabilities Assessment	Benchmark performance on representative tasks and datasets. Measure capabilities like accuracy, robustness, efficiency. Identify strengths, limitations, and gaps.
Adversarial Testing	Probe with malformed, adversarial inputs. Check for crashes, unintended behavior, security risks. Informed by threat models, risk analysis.
Red Teaming	Model potential real-world risks and failures. Role play adversary perspectives. Surface risks unique to AI.
Human Oversight	Manual test cases based on human judgment.

Attack Examples

Mosaic Prompt : breakdown a prompt into permissible components

Users break down impermissible content into small permissible components.
Each component is queried independently and appears harmless.
User recombines components to reconstruct impermissible content.
Exploits compositionality of language.

Cross-Lingual Attacks : translating between high and low-resource languages for attacking multi-lingual capability

The attack involves translating unsafe English input prompts into low-resource natural languages using Google Translate.
Low-resource languages are those with limited training data, like Zulu.
The translated prompts are sent to GPT-4, which then responds unsafely instead of refusing.
The attack exploits uneven multilingual training of GPT-4's safety measures.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
app		app
img		img
pkg		pkg
public		public
util		util
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REDTEAM.md		REDTEAM.md
jest.config.js		jest.config.js
next.config.js		next.config.js
package.json		package.json
postcss.config.js		postcss.config.js
rome.json		rome.json
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Red Teaming LLMs

Goals

Challenges & Efforts

Built with

Red Teaming

Attack Examples

About

Releases

Packages

Languages

License

equiano-institute/haystack

Folders and files

Latest commit

History

Repository files navigation

Red Teaming LLMs

Goals

Challenges & Efforts

Built with

Red Teaming

Attack Examples

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages