NLP-ADBench is a benchmarking tool for Anomaly Detection in Natural Language Processing. It provides a comprehensive evaluation of the performance of various anomaly detection algorithms on a wide range of NLP datasets. The tool is designed to evaluate the performance of anomaly detection algorithms on NLP datasets using a two-step approach.
The datasets required for this project can be downloaded from the following huggingface links:
-
NLPAD Datasets: These are the datasets mentioned in NLP-ADBench paper. You can download them from:
-
Pre-Extracted Embeddings: We evaluate 8 two-step NLP-AD algorithms that rely on embeddings generated by models such as BERT
bert-base-uncased
and OpenAItext-embedding-3-large
model. These algorithms are designed to work with structured numerical data and cannot directly process raw textual data, requiring text transformation into numerical embeddings. We have already extracted these embeddings for your convenience. If you want to use them directly, you can download them from:
Follow these steps to set up the development environment using the provided Conda environment file:
-
Install Anaconda or Miniconda: Download and install Anaconda or Miniconda from here.
-
Create the Environment: Using the terminal, navigate to the directory containing the
environment.yml
file and run:conda env create -f environment.yml
-
Activate the Environment: Activate the newly created environment using:
conda activate nlpad
Get Pre-Extracted Embeddings
data from the huggingface link and put it in the data folder.
Place all downloaded embeddings data into the ./feature
folder in the ./benchmark
directory of this project.
Run the following commands from the ./benchmark
directory of the project:
If you want to run a benchmark using data embedded with BERT's bert-base-uncased
model, use this command:
python [your_script_name].py bert
If you want to run a benchmark using data embedded with OpenAI's text-embedding-3-large
model, use this command:
python [your_script_name].py gpt