In recent years, the malware industry has experienced rapid growth, with syndicates investing heavily in evading traditional protection measures. This necessitates the development of robust software to detect and terminate malware attacks. The primary goal of this project is to identify whether a given file/software is malware by leveraging machine learning techniques.
- Minimize multi-class error.
- Provide multi-class probability estimates.
- Ensure that malware detection finishes within a few seconds or a minute to avoid blocking the user's computer.
The dataset used for this project can be found on Kaggle. It consists of two types of files for each malware:
.asm
file: Contains assembly code (read more about ASM files here)..bytes
file: Contains the raw hexadecimal representation of the file's binary content without the PE header.
The train dataset is divided into two subdirectories:
50GB
of data in.bytes
files.150GB
of data in.asm
files.
There are a total of 21,736
files, with 10,868
files in each format. The dataset includes 9
types of malware classes:
- Ramnit 🐞
- Lollipop 🍭
- Kelihos_ver3 🦠
- Vundo 🦠
- Simda 🦠
- Tracur 🦠
- Kelihos_ver1 🦠
- Obfuscator.ACY 🦠
- Gatak 🦠
The performance of the malware detection model will be evaluated using the following metrics:
- Multi-class log-loss
- Confusion matrix
-
Data Preprocessing 📊: Convert
.bytes
files into a bag-of-words representation using the hexadecimal codes. Extract features from.asm
files based on count vectors for prefixes, opcodes, keywords, and registers. Include file size as a feature in the analysis. -
Univariate Analysis 📊: Perform univariate analysis using box plots to identify any outliers, visualize the distribution of individual features, and gain insights into their statistical properties.
-
Multivariate Analysis using t-SNE 📊: Visualize the high-dimensional feature space using t-SNE to gain further insights into the data distribution and identify potential clusters or patterns.
-
Model Training and Evaluation 🧠: Train machine learning models such as XGBoost, Decision Tree, K-Nearest Neighbors, Logistic Regression, and Random Forest on the preprocessed data. Use evaluation metrics such as multi-class log-loss and confusion matrix to assess the performance of the models.
The following machine learning models were used for malware detection:
- K-Nearest Neighbors (KNN)
- Logistic Regression (LR)
- Random Forest (RF)
- XGBoost (XGB)
These models were applied to both the .asm
files and .byte
files separately, as well as on the combined data.
- The
.asm
files were preprocessed to extract features based on count vectors for prefixes, opcodes, keywords, and registers. - The extracted features were used as inputs to train the KNN, LR, RF, and XGB models separately.
- The models were trained using the preprocessed
.asm
file data and evaluated using appropriate evaluation metrics such as multi-class log-loss and confusion matrix.
- The
.byte
files were converted into a bag-of-words representation using hexadecimal codes. - The bag-of-words representation was used as input to train the KNN, LR, RF, and XGB models separately.
- Similar to the
.asm
file analysis, the models were trained using the preprocessed.byte
file data and evaluated using appropriate evaluation metrics.
- The features extracted from both the
.asm
files and.byte
files were combined to create a consolidated feature set. - The combined feature set was used as input to train the KNN, LR, RF, and XGB models.
- The models were trained using the combined feature set and evaluated using appropriate evaluation metrics.
- Open the Jupyter Notebook file
MicrosoftMalwareDetection.ipynb
in Jupyter Notebook. - Run the cells in the notebook to preprocess the data, perform feature engineering, train the KNN, LR, RF, and XGB models on both the
.asm
files and.byte
files separately, as well as on the combined data. - Evaluate the performance of each model using appropriate evaluation metrics such as multi-class log-loss and confusion matrix.
- Follow the instructions provided within the notebook to generate visualizations, interpret the results, and make data-driven decisions.
Model | Log Loss | Misclassified % |
---|---|---|
Random Model | 2.45 | 88% |
KNN | 0.24 | 4.5% |
RF | 0.085 | 2.02% |
XgB | 0.078 | 1.24% |
Model | Log Loss | Misclassified % |
---|---|---|
KNN | 0.089 | 2.02% |
LR | 0.415 | 9.61% |
RF | 0.057 | 1.15% |
XgB | 0.048 | 0.87% |
Model | Log Loss | Misclassified % |
---|---|---|
RF | 0.04 | <1% |
XgB | 0.031 | <1% |
If you encounter any issues or bugs while using this project, please let us know by creating an issue. Your feedback and bug reports are valuable to me, and will appreciate your contribution in improving the project.
We will do my best to address and resolve the issues promptly. Thank you!
If you find this project useful or helpful, we kindly request you to star ⭐️ the repository. Your support is greatly appreciated!