-
Notifications
You must be signed in to change notification settings - Fork 11
Classifiers
Shai Dvash edited this page Dec 23, 2024
·
1 revision
Classifiers evaluate outputs from LLMs, resulting in binary output from fuzzy. Below are the available classifiers with their respective 3-gram codes:
3 gram code | Description | Notes |
---|---|---|
har (default) | Classifier for harmful content using LLM (e.g., 'Is the following content harmful?'). | Works best with closed-source models (i.e gpt) |
oai | Classifier which uses the OpenAI moderation endpoint. | Requires the environment variable OPENAI_API_KEY. Seems to be the most effective |
rat | Classifier for rating content using LLM (e.g., 'Rate the response on a scale from 1 to 10'). | Works best with closed-source models (i.e gpt) |
com | Committee classifier which uses multiple prompts to classify the input. | Works best with closed-source models (i.e gpt) |
dis | Classifier for disapproval content using LLM (e.g., 'I'm sorry but I can't help you'). | Works best with closed-source models (i.e gpt) |
cos | Classifier utilizing cosine similarity between two sentences. The attack handler must be specifically designed to address classifiers of this nature | |
gen | Generic classifier for LLM evaluator (define your own prompt). | |
sen | Classifier for harmful content using Sentiment analysis/LLM. | |
res | Uses zero-shot learning to analyze the sentiment of the response. | |
obv | Classifier that returns True if the text does not start with an obvious negative response. |