Skip to content
Shai Dvash edited this page Dec 29, 2024 · 5 revisions

FUZZY Documentation

Welcome to our wiki! Select a section below to learn more:

  • PROVIDERS
    Learn more about which providers we support and how to use them.
  • MODELS
    Learn more about the supported models.
  • ATTACKS
    See what we already implemented and how you can use it.
  • CLASSIFIERS
    Classifiers evaluate output. We've implemented a few you can use.
  • MUTATORS
    Mutators alter textual input and can serve as a 'gatekeeper' to LLMs.
  • EXTENSIBILITY
    Want to implement your own? Read here on how to extend FUZZY's functionality.

Datasets

We've included a few datasets you can use, they're to be found under the resources/ folder
Note: Some of the prompts may be grammatically incorrect; this is intentional, as it appears to be more effective against the models.

File name Description
pandoras_prompts.txt Harmful prompts
adv_prompts.txt Harmful prompts
benign_prompts.txt Regular prompts
history_prompts.txt Harmful prompts phrased as in "Back To The Past" attack
harmful_behaviors.csv Harmful prompts
adv_suffixes.txt Random prompt suffixes
alpaca_data_instructions.json alpaca benign queries dataset
taxonomy_gpt35_harmful_behaviors_first26.json persuasive prompts
finetuned_summarizer_train_dataset.jsonl Dataset used to train a GPT fine-tuned summarizer (See Paper page 20)

Persisting Your Settings

To save your configuration, you can create a JSON-formatted config file where the keys correspond to the long-form command-line flags. For example, see config_example.json:

{
  "model": [
    "ollama/mistral"
  ],
  "attack_modes": [
    "def",
    "art"
  ],
  "classifier": [
    "har"
  ],
  "extra": [
    "blacklisted_words=acid"
  ]
}

Once you've customized the configuration to your needs, you can apply these settings by running the following command:prev

python run.py -C config_example.json -t "Harmful_Prompt"
Clone this wiki locally