Classifiers

Introduction

Classifiers evaluate outputs from LLMs, resulting in binary output from fuzzy. Below are the available classifiers with their respective 3-gram codes:

3 gram code	Description	Notes
har (default)	Classifier for harmful content using LLM (e.g., 'Is the following content harmful?').	Works best with closed-source models (i.e gpt)
oai	Classifier which uses the OpenAI moderation endpoint.	Requires the environment variable OPENAI_API_KEY. Seems to be the most effective
rat	Classifier for rating content using LLM (e.g., 'Rate the response on a scale from 1 to 10').	Works best with closed-source models (i.e gpt)
com	Committee classifier which uses multiple prompts to classify the input.	Works best with closed-source models (i.e gpt)
dis	Classifier for disapproval content using LLM (e.g., 'I'm sorry but I can't help you').	Works best with closed-source models (i.e gpt)
cos	Classifier utilizing cosine similarity between two sentences. The attack handler must be specifically designed to address classifiers of this nature
gen	Generic classifier for LLM evaluator (define your own prompt).
sen	Classifier for harmful content using Sentiment analysis/LLM.
res	Uses zero-shot learning to analyze the sentiment of the response.
obv	Classifier that returns True if the text does not start with an obvious negative response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classifiers

Introduction

Clone this wiki locally