🔍 Malware Detection Using Machine Learning

🎯 Problem Statement

In recent years, the malware industry has experienced rapid growth, with syndicates investing heavily in evading traditional protection measures. This necessitates the development of robust software to detect and terminate malware attacks. The primary goal of this project is to identify whether a given file/software is malware by leveraging machine learning techniques.

🌍 Real-world/Business Objectives and Constraints

Minimize multi-class error.
Provide multi-class probability estimates.
Ensure that malware detection finishes within a few seconds or a minute to avoid blocking the user's computer.

📊 Data Overview

The dataset used for this project can be found on Kaggle. It consists of two types of files for each malware:

.asm file: Contains assembly code (read more about ASM files here).
.bytes file: Contains the raw hexadecimal representation of the file's binary content without the PE header.

The train dataset is divided into two subdirectories:

50GB of data in .bytes files.
150GB of data in .asm files.

There are a total of 21,736 files, with 10,868 files in each format. The dataset includes 9 types of malware classes:

Ramnit 🐞
Lollipop 🍭
Kelihos_ver3 🦠
Vundo 🦠
Simda 🦠
Tracur 🦠
Kelihos_ver1 🦠
Obfuscator.ACY 🦠
Gatak 🦠

📈 Performance Metric

The performance of the malware detection model will be evaluated using the following metrics:

Multi-class log-loss
Confusion matrix

📚 Methodology

Data Preprocessing 📊: Convert .bytes files into a bag-of-words representation using the hexadecimal codes. Extract features from .asm files based on count vectors for prefixes, opcodes, keywords, and registers. Include file size as a feature in the analysis.
Univariate Analysis 📊: Perform univariate analysis using box plots to identify any outliers, visualize the distribution of individual features, and gain insights into their statistical properties.
Multivariate Analysis using t-SNE 📊: Visualize the high-dimensional feature space using t-SNE to gain further insights into the data distribution and identify potential clusters or patterns.
Model Training and Evaluation 🧠: Train machine learning models such as XGBoost, Decision Tree, K-Nearest Neighbors, Logistic Regression, and Random Forest on the preprocessed data. Use evaluation metrics such as multi-class log-loss and confusion matrix to assess the performance of the models.

🧠 Machine Learning Models

The following machine learning models were used for malware detection:

K-Nearest Neighbors (KNN)
Logistic Regression (LR)
Random Forest (RF)
XGBoost (XGB)

These models were applied to both the .asm files and .byte files separately, as well as on the combined data.

.asm File Analysis

The .asm files were preprocessed to extract features based on count vectors for prefixes, opcodes, keywords, and registers.
The extracted features were used as inputs to train the KNN, LR, RF, and XGB models separately.
The models were trained using the preprocessed .asm file data and evaluated using appropriate evaluation metrics such as multi-class log-loss and confusion matrix.

.byte File Analysis

The .byte files were converted into a bag-of-words representation using hexadecimal codes.
The bag-of-words representation was used as input to train the KNN, LR, RF, and XGB models separately.
Similar to the .asm file analysis, the models were trained using the preprocessed .byte file data and evaluated using appropriate evaluation metrics.

Combined Data Analysis

The features extracted from both the .asm files and .byte files were combined to create a consolidated feature set.
The combined feature set was used as input to train the KNN, LR, RF, and XGB models.
The models were trained using the combined feature set and evaluated using appropriate evaluation metrics.

🚀 Usage

Open the Jupyter Notebook file MicrosoftMalwareDetection.ipynb in Jupyter Notebook.
Run the cells in the notebook to preprocess the data, perform feature engineering, train the KNN, LR, RF, and XGB models on both the .asm files and .byte files separately, as well as on the combined data.
Evaluate the performance of each model using appropriate evaluation metrics such as multi-class log-loss and confusion matrix.
Follow the instructions provided within the notebook to generate visualizations, interpret the results, and make data-driven decisions.

Machine Learning Model Performance

`.byte` Files

Model	Log Loss	Misclassified %
Random Model	2.45	88%
KNN	0.24	4.5%
RF	0.085	2.02%
XgB	0.078	1.24%

`.asm` Files

Model	Log Loss	Misclassified %
KNN	0.089	2.02%
LR	0.415	9.61%
RF	0.057	1.15%
XgB	0.048	0.87%

Combined (`.byte` and `.asm`) Files

Model	Log Loss	Misclassified %
RF	0.04	<1%
XgB	0.031	<1%

🐞 Found a Bug?

If you encounter any issues or bugs while using this project, please let us know by creating an issue. Your feedback and bug reports are valuable to me, and will appreciate your contribution in improving the project.

We will do my best to address and resolve the issues promptly. Thank you!

⭐️ Please Star

If you find this project useful or helpful, we kindly request you to star ⭐️ the repository. Your support is greatly appreciated!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
MicrosoftMalwareDetection.ipynb		MicrosoftMalwareDetection.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Malware Detection Using Machine Learning

🎯 Problem Statement

🌍 Real-world/Business Objectives and Constraints

📊 Data Overview

📈 Performance Metric

📚 Methodology

🧠 Machine Learning Models

.asm File Analysis

.byte File Analysis

Combined Data Analysis

🚀 Usage

Machine Learning Model Performance

`.byte` Files

`.asm` Files

Combined (`.byte` and `.asm`) Files

🐞 Found a Bug?

⭐️ Please Star

About

Releases

Packages

Languages

update-ankur/Microsoft-Malware-Detection

Folders and files

Latest commit

History

Repository files navigation

🔍 Malware Detection Using Machine Learning

🎯 Problem Statement

🌍 Real-world/Business Objectives and Constraints

📊 Data Overview

📈 Performance Metric

📚 Methodology

🧠 Machine Learning Models

.asm File Analysis

.byte File Analysis

Combined Data Analysis

🚀 Usage

Machine Learning Model Performance

.byte Files

.asm Files

Combined (.byte and .asm) Files

🐞 Found a Bug?

⭐️ Please Star

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`.byte` Files

`.asm` Files

Combined (`.byte` and `.asm`) Files

Packages