This project implements a hybrid log classification system, combining three complementary approaches to handle varying levels of complexity in log patterns. The classification methods ensure flexibility and effectiveness in processing predictable, complex, and poorly-labeled data patterns.
-
Regular Expression (Regex):
- Handles the most simplified and predictable patterns.
- Useful for patterns that are easily captured using predefined rules.
-
Sentence Transformer + Logistic Regression:
- Manages complex patterns when there is sufficient training data.
- Utilizes embeddings generated by Sentence Transformers and applies Logistic Regression as the classification layer.
-
LLM (Large Language Models):
- Used for handling complex patterns when sufficient labeled training data is not available.
- Provides a fallback or complementary approach to the other methods.
datasets/:- This folder contains resource files such as test CSV files, output files, etc.
-
Install Dependencies: Make sure you have Python installed on your system. Install the required Python libraries by running the following command:
pip install -r requirements.txt
- Setup Google Colab If running the notebook in Google Colab, mount your Google Drive to access the dataset
from google.colab import drive
drive.mount('/content/drive')- Prepare Dataset Place your synthetic_logs(2).csv dataset in the appropriate directory.
Update file paths in the notebook or scripts if necessary.
Upload a CSV file containing logs for classification. Ensure the file has the following columns:
sourcelog_message
The output will be a CSV file with an additional column target_label, which represents the classified label for each log entry.
