Skip to content

kooroshsajadi/malware_classification_using_deep_learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malware Detection Using Deep Learning

Dataset

The dataset used for model training and evaluation consists of 8,970 malware samples in Portable Executable (PE) format. These samples are categorized into the following malware families:

Malware Family Sample Count
Locker 330
Mediyes 1,450
WinWebSec 4,400
Zbot 2,100
ZeroAccess 690

The dataset is divided into training and test sets using a 70/30 split and is available at the below link.

https://figshare.com/articles/dataset/Malware_Detection_PE-Based_Analysis_Using_Deep_Learning_Algorithm_Dataset/6635642

Training Set:

  • Locker: 231 samples
  • Mediyes: 1,015 samples
  • WinWebSec: 3,080 samples
  • Zbot: 1,470 samples
  • ZeroAccess: 483 samples

Test Set:

  • Locker: 99 samples
  • Mediyes: 435 samples
  • WinWebSec: 1,320 samples
  • Zbot: 630 samples
  • ZeroAccess: 207 samples

Each sample is labeled according to its malware family. The dataset is used for opcode-based feature extraction and classification tasks.

Data Preprocessing

The data preprocessing pipeline ensures the proper extraction of operation codes (opcodes) from malware files for further analysis and classification. The process is carried out in three main steps using Ghidra Headless and supporting scripts.

1. Unpacking PE Files

Before analyzing the Portable Executable (PE) files, they need to be unpacked. This is done using UPX, a widely used packer. The script upx_unpacking.sh provides an example of how to automate this process.

2. Batch Processing with Ghidra Headless

To extract opcodes from the unpacked files, we utilize Ghidra Headless, the non-GUI version of Ghidra, which allows batch processing of multiple files. The script ghidra_analysis_xmlgen_and_opcodesext.sh automates this process by iterating over the files and executing the necessary extraction steps.

3. Opcode Extraction Using Ghidra Scripts

During the batch processing phase, two Java scripts are executed sequentially:

  • ExportToXML.java: Converts each analyzed file's architecture into an XML format.
  • ExtractOpcodesFromCodeSection.java: Reads the addresses from the XML file that point to the code section opcodes, extracts them, and saves the output in a text file for further use.

Both of these scripts are located in the utils folder.

This structured preprocessing ensures that extracted opcodes are correctly formatted and ready for the next phase of AI-driven malware classification.

Feature Engineering

1. Opcode Sequence Translation to Image Using SimHash

2. Opcode Frequency Vectors

3. Opcode Sequence Embeddings

4. Opcode Sequence Translation to Image Using dHash

5. LSTM Last Hidden State Translation to Image

Classifiers

1. MLP Classifiers

2. CNN Classifiers

3. LSTM Classifiers

License

This project is licensed under the MIT License. See the LICENSE file for more information.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published