Skip to content

update-ankur/Microsoft-Malware-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

🔍 Malware Detection Using Machine Learning

🎯 Problem Statement

In recent years, the malware industry has experienced rapid growth, with syndicates investing heavily in evading traditional protection measures. This necessitates the development of robust software to detect and terminate malware attacks. The primary goal of this project is to identify whether a given file/software is malware by leveraging machine learning techniques.

🌍 Real-world/Business Objectives and Constraints

  • Minimize multi-class error.
  • Provide multi-class probability estimates.
  • Ensure that malware detection finishes within a few seconds or a minute to avoid blocking the user's computer.

📊 Data Overview

The dataset used for this project can be found on Kaggle. It consists of two types of files for each malware:

  1. .asm file: Contains assembly code (read more about ASM files here).
  2. .bytes file: Contains the raw hexadecimal representation of the file's binary content without the PE header.

The train dataset is divided into two subdirectories:

  • 50GB of data in .bytes files.
  • 150GB of data in .asm files.

There are a total of 21,736 files, with 10,868 files in each format. The dataset includes 9 types of malware classes:

  • Ramnit 🐞
  • Lollipop 🍭
  • Kelihos_ver3 🦠
  • Vundo 🦠
  • Simda 🦠
  • Tracur 🦠
  • Kelihos_ver1 🦠
  • Obfuscator.ACY 🦠
  • Gatak 🦠

📈 Performance Metric

The performance of the malware detection model will be evaluated using the following metrics:

  • Multi-class log-loss
  • Confusion matrix

📚 Methodology

  1. Data Preprocessing 📊: Convert .bytes files into a bag-of-words representation using the hexadecimal codes. Extract features from .asm files based on count vectors for prefixes, opcodes, keywords, and registers. Include file size as a feature in the analysis.

  2. Univariate Analysis 📊: Perform univariate analysis using box plots to identify any outliers, visualize the distribution of individual features, and gain insights into their statistical properties.

  3. Multivariate Analysis using t-SNE 📊: Visualize the high-dimensional feature space using t-SNE to gain further insights into the data distribution and identify potential clusters or patterns.

  4. Model Training and Evaluation 🧠: Train machine learning models such as XGBoost, Decision Tree, K-Nearest Neighbors, Logistic Regression, and Random Forest on the preprocessed data. Use evaluation metrics such as multi-class log-loss and confusion matrix to assess the performance of the models.

🧠 Machine Learning Models

The following machine learning models were used for malware detection:

  • K-Nearest Neighbors (KNN)
  • Logistic Regression (LR)
  • Random Forest (RF)
  • XGBoost (XGB)

These models were applied to both the .asm files and .byte files separately, as well as on the combined data.

.asm File Analysis

  • The .asm files were preprocessed to extract features based on count vectors for prefixes, opcodes, keywords, and registers.
  • The extracted features were used as inputs to train the KNN, LR, RF, and XGB models separately.
  • The models were trained using the preprocessed .asm file data and evaluated using appropriate evaluation metrics such as multi-class log-loss and confusion matrix.

.byte File Analysis

  • The .byte files were converted into a bag-of-words representation using hexadecimal codes.
  • The bag-of-words representation was used as input to train the KNN, LR, RF, and XGB models separately.
  • Similar to the .asm file analysis, the models were trained using the preprocessed .byte file data and evaluated using appropriate evaluation metrics.

Combined Data Analysis

  • The features extracted from both the .asm files and .byte files were combined to create a consolidated feature set.
  • The combined feature set was used as input to train the KNN, LR, RF, and XGB models.
  • The models were trained using the combined feature set and evaluated using appropriate evaluation metrics.

🚀 Usage

  1. Open the Jupyter Notebook file MicrosoftMalwareDetection.ipynb in Jupyter Notebook.
  2. Run the cells in the notebook to preprocess the data, perform feature engineering, train the KNN, LR, RF, and XGB models on both the .asm files and .byte files separately, as well as on the combined data.
  3. Evaluate the performance of each model using appropriate evaluation metrics such as multi-class log-loss and confusion matrix.
  4. Follow the instructions provided within the notebook to generate visualizations, interpret the results, and make data-driven decisions.

Machine Learning Model Performance

.byte Files

Model Log Loss Misclassified %
Random Model 2.45 88%
KNN 0.24 4.5%
RF 0.085 2.02%
XgB 0.078 1.24%

.asm Files

Model Log Loss Misclassified %
KNN 0.089 2.02%
LR 0.415 9.61%
RF 0.057 1.15%
XgB 0.048 0.87%

Combined (.byte and .asm) Files

Model Log Loss Misclassified %
RF 0.04 <1%
XgB 0.031 <1%

🐞 Found a Bug?

If you encounter any issues or bugs while using this project, please let us know by creating an issue. Your feedback and bug reports are valuable to me, and will appreciate your contribution in improving the project.

We will do my best to address and resolve the issues promptly. Thank you!

⭐️ Please Star

If you find this project useful or helpful, we kindly request you to star ⭐️ the repository. Your support is greatly appreciated!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published