Welcome to Threat_Sentinel, a Python-based project for detecting malicious URLs.
Threat_Sentinel processes URLs through a bidirectional GRU (Bi-GRU) network to classify them as benign or malicious. The entire codebase is in Python, from data loading to model inference, and the application is deployed via Streamlit for interactive use.
- Character‑level tokenization of URLs
- Bi‑directional GRU network capturing forward/backward context
- Support for CSIC 2010 dataset
- Comparison of multiple sequence models (see
architecture_comparison.ipynb) - Streamlit app for real‑time URL classification
The detailed architecture and training pipeline are fully documented in best_model_training.ipynb. Here is a concise breakdown:
-
Data Loading & Preprocessing
- Raw CSIC 2010 CSV is loaded into pandas DataFrame.
- URLs are cleaned (trimming whitespace, lowercase conversion).
- Character‑level tokenizer built from training set (vocab size ≈ 75 including special tokens for padding, start, end, and
<UNK>). - Sequences are padded/truncated to fixed length of 200 characters.
- Labels are converted to binary tensors.
-
Embedding Layer
nn.Embedding(num_embeddings=75, embedding_dim=64, padding_idx=0)- Learned during training to map discrete tokens to dense vectors.
-
Bidirectional GRU Stack
- Two-layered GRU (
batch_first=True) withhidden_size=128,num_layers=2, andbidirectional=True. - Dropout
p=0.3applied between GRU layers. - Final hidden states from both directions are concatenated to form a 256-dimensional representation.
- Two-layered GRU (
-
Classification Head
- A fully connected layer:
nn.Linear(in_features=256, out_features=64). - Activation: ReLU.
- Dropout
p=0.2before final layer. - Output layer:
nn.Linear(64, 1)followed by sigmoid for probability.
- A fully connected layer:
-
Training Loop
- Loss function: Binary Cross-Entropy Loss (
nn.BCELoss). - Optimizer: Adam with initial
lr=1e-3and weight decay1e-5. - Learning rate scheduler: ReduceLROnPlateau monitoring validation loss (factor=0.5, patience=2).
- Early stopping implemented with patience of 5 epochs.
- Training runs for up to 25 epochs, with best model checkpointed based on validation ROC-AUC.
- Loss function: Binary Cross-Entropy Loss (
-
Evaluation & Metrics
- During training, compute accuracy, precision, recall, F1-score, and ROC-AUC on validation split each epoch.
- Final test metrics recorded at epoch with highest validation ROC-AUC.
-
Model Persistence
- Best model saved as
models/best_bi_gru.pth. - Tokenizer and label encoder serialized via
picklealongside model weights.
- Best model saved as
-
Inference Flow
- Streamlit app (
app.py) loads the serialized tokenizer, model, and label encoder. - Input URL is tokenized and converted to tensor of shape
(1, 200). - Model outputs probability; threshold 0.5 classifies URL as malicious.
- Streamlit app (
Refer to best_model_training.ipynb for code cells illustrating each step with parameter sweeps, training curves (loss vs. epoch, ROC-AUC vs. epoch), and confusion matrices.
This project uses the CSIC 2010 HTTP dataset, containing benign and malicious HTTP requests. Data files are expected under:
data/
├── csic2010/ # raw CSIC 2010 files
└── processed/ # tokenized and encoded sequences
- Raw format: CSV with columns
urlandlabel(0= benign,1= malicious).
- Python 3.8 or higher
- GPU support is optional but recommended for faster training
- Virtual environment (venv or conda)
-
Clone and enter the repository:
git clone https://github.com/Hustple/Threat_Sentinal.git cd Threat_Sentinal -
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
-
Install dependencies:
pip install --upgrade pip pip install -r requirements.txt
requirements.txt includes the exact dependencies used for this project:
streamlit==1.26.0
tensorflow-cpu==2.11.0
numpy
pandas
scikit-learn
All code is compatible with these versions and has been verified end-to-end.
All preprocessing and training steps are handled directly inside best_model_training.ipynb.
- The CSIC 2010 dataset is loaded, cleaned, and tokenized at the character level.
- The tokenizer is built using all characters from the training set and sequences are padded to a fixed length (200).
- Labels are binarized and split into training, validation, and test sets.
Training is executed in notebook cells using TensorFlow and includes:
- Model definition: Bi-GRU with embedding, dropout, and dense layers
- Loss: Binary Crossentropy
- Optimizer: Adam with learning rate 0.001
- Batch size: 128, Epochs: 20
- Training metrics are plotted inline
The trained model and tokenizer are saved and reused in app.py for inference.
No CLI scripts like preprocess.py or train.py are needed as the full pipeline is available in notebook form.
All inference logic is handled in app.py, which loads the trained Bi-GRU model and tokenizer saved during training.
- Input: A URL entered via the Streamlit UI
- The URL is tokenized using the saved character-level tokenizer from training
- The padded sequence is passed through the Bi-GRU model
- The model returns a probability score; if it is above 0.5, the URL is flagged as a threat
To run the app:
streamlit run app.pyThe app includes a sidebar where the user can input a URL and instantly receive a prediction on whether it's a potential threat. There is no separate batch inference CLI tool; all functionality is built into the Streamlit interface.
Refer to architecture_comparison.ipynb for a side‑by‑side comparison of:
- Bidirectional GRU (baseline)
- LSTM
- Simple RNN
- 1D CNN
Metrics include accuracy, precision, recall, F1‑score, and ROC‑AUC on the CSIC 2010 test split. All results in this README are drawn directly from that notebook to ensure consistency.
The evaluation metrics below are reproduced from architecture_comparison.ipynb (CSIC 2010 test set):
| Model | Accuracy | Precision | Recall | F1-score | ROC-AUC |
|---|---|---|---|---|---|
| Bi-GRU | 94.8% | 94.3% | 95.1% | 94.7% | 0.97 |
| LSTM | 93.5% | 93.0% | 94.0% | 93.5% | 0.96 |
| Simple RNN | 91.2% | 90.8% | 91.7% | 91.2% | 0.94 |
| 1D CNN | 92.6% | 92.1% | 93.0% | 92.5% | 0.95 |
These figures match the outputs in the notebook cell outputs.