Experiments are based on the light version of IoT-23 [1] dataset.
No | Name |
Version | Description |
---|---|---|---|
1 | Python | 3.8.8 | Programming Language |
2 | scikit-learn | 0.24.1 | Tools for Machine Learning in Python |
3 | NymPy | 1.19.5 | Tools for Scientific Computing in Python |
4 | pandas | 1.2.2 | Tools for Data Analysis & Data Manipulation in Python |
5 | matplotlib | 3.3.4 | Visualization with Python |
6 | seaborn | 0.11.1 | Statistical data visualization |
7 | psutil | 5.8.0 | Cross-platform library for retrieving information on running processes and system utilization (CPU, memory, disks, network, sensors) in Python |
8 | scikit-plot | 0.3.7 | Library for visualizations |
9 | pickle | - | Python object serialization for model serialization |
- Download the lighter version of IoT-23 (archive size - 8.8 GB)
The lighter version contains only labeled flows without the pcaps files
- Extract Archive (size - approx. 44 GB)
- Clone this repo
- Install missing libraries
- Open config.py and configure required directories
- iot23_scenarios_dir should point to the home folder, where iot23 scenarios are located
- iot23_attacks_dir will be used to store files for each attack type from the scenarios files
- iot23_data_dir will be used to store files with data, extracted from attack files
- iot23_experiments_dir will be used to store experiment files, including trained models and results (Excel files & Charts)
- Check configuration by running run_step00_configuration_check.py
Make sure the output message says that you may continue to the next step. If not, then check your configuration and fix the errors.
Run data extraction by running run_step01_extract_data_from_scenarios.py
Even though, there are multiple scenarios, files still contain mixed attack and benign traffic. For this reason we are going to extract the entries of a similar type into separate files. The output files will be stored to iot23_attacks_dir.
⚠️ This step takes about 2h to complete.
Run content shuffling by running run_step01_shuffle_file_content.py
This step will provide more reliable data samples. Larger files are split into partitions of 1 GB. Then the content of all partitions (of the same file) gets shuffled. When shuffling is ready, the partitions are merged back into a single file, that replaces the original one.
⚠️ This step takes about 2.5 - 3h to complete.
Use this option to check if everything is ok. It uses only 10_000 records per file, so that the whole process runs for a couple of minutes, if the data is already prepared.
⚠️ ⚠️ ⚠️ This step takes about 24h to complete!Data samples for training and testing consist of more than 20M records.
TODO
TODO
[1]: “Stratosphere Laboratory. A labeled dataset with malicious and benign IoT network traffic. January 22th. Agustin Parmisano, Sebastian Garcia, Maria Jose Erquiaga. Online: https://www.stratosphereips.org/datasets-iot23