Skip to content

Classifiers

Shai Dvash edited this page Dec 23, 2024 · 1 revision

Introduction

Classifiers evaluate outputs from LLMs, resulting in binary output from fuzzy. Below are the available classifiers with their respective 3-gram codes:

3 gram code Description Notes
har (default) Classifier for harmful content using LLM (e.g., 'Is the following content harmful?'). Works best with closed-source models (i.e gpt)
oai Classifier which uses the OpenAI moderation endpoint. Requires the environment variable OPENAI_API_KEY. Seems to be the most effective
rat Classifier for rating content using LLM (e.g., 'Rate the response on a scale from 1 to 10'). Works best with closed-source models (i.e gpt)
com Committee classifier which uses multiple prompts to classify the input. Works best with closed-source models (i.e gpt)
dis Classifier for disapproval content using LLM (e.g., 'I'm sorry but I can't help you'). Works best with closed-source models (i.e gpt)
cos Classifier utilizing cosine similarity between two sentences. The attack handler must be specifically designed to address classifiers of this nature
gen Generic classifier for LLM evaluator (define your own prompt).
sen Classifier for harmful content using Sentiment analysis/LLM.
res Uses zero-shot learning to analyze the sentiment of the response.
obv Classifier that returns True if the text does not start with an obvious negative response.

Clone this wiki locally