Malware Detection Using Deep Learning

Dataset

The dataset used for model training and evaluation consists of 8,970 malware samples in Portable Executable (PE) format. These samples are categorized into the following malware families:

Malware Family	Sample Count
Locker	330
Mediyes	1,450
WinWebSec	4,400
Zbot	2,100
ZeroAccess	690

The dataset is divided into training and test sets using a 70/30 split and is available at the below link.

https://figshare.com/articles/dataset/Malware_Detection_PE-Based_Analysis_Using_Deep_Learning_Algorithm_Dataset/6635642

Training Set:

Locker: 231 samples
Mediyes: 1,015 samples
WinWebSec: 3,080 samples
Zbot: 1,470 samples
ZeroAccess: 483 samples

Test Set:

Locker: 99 samples
Mediyes: 435 samples
WinWebSec: 1,320 samples
Zbot: 630 samples
ZeroAccess: 207 samples

Each sample is labeled according to its malware family. The dataset is used for opcode-based feature extraction and classification tasks.

Data Preprocessing

The data preprocessing pipeline ensures the proper extraction of operation codes (opcodes) from malware files for further analysis and classification. The process is carried out in three main steps using Ghidra Headless and supporting scripts.

1. Unpacking PE Files

Before analyzing the Portable Executable (PE) files, they need to be unpacked. This is done using UPX, a widely used packer. The script upx_unpacking.sh provides an example of how to automate this process.

2. Batch Processing with Ghidra Headless

To extract opcodes from the unpacked files, we utilize Ghidra Headless, the non-GUI version of Ghidra, which allows batch processing of multiple files. The script ghidra_analysis_xmlgen_and_opcodesext.sh automates this process by iterating over the files and executing the necessary extraction steps.

3. Opcode Extraction Using Ghidra Scripts

During the batch processing phase, two Java scripts are executed sequentially:

ExportToXML.java: Converts each analyzed file's architecture into an XML format.
ExtractOpcodesFromCodeSection.java: Reads the addresses from the XML file that point to the code section opcodes, extracts them, and saves the output in a text file for further use.

Both of these scripts are located in the utils folder.

This structured preprocessing ensures that extracted opcodes are correctly formatted and ready for the next phase of AI-driven malware classification.

Feature Engineering

1. Opcode Sequence Translation to Image Using SimHash

2. Opcode Frequency Vectors

3. Opcode Sequence Embeddings

4. Opcode Sequence Translation to Image Using dHash

5. LSTM Last Hidden State Translation to Image

Classifiers

1. MLP Classifiers

2. CNN Classifiers

3. LSTM Classifiers

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 316 Commits
.dvc		.dvc
__pycache__		__pycache__
data		data
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Malware Detection Using Deep Learning

Dataset

Data Preprocessing

1. Unpacking PE Files

2. Batch Processing with Ghidra Headless

3. Opcode Extraction Using Ghidra Scripts

Feature Engineering

1. Opcode Sequence Translation to Image Using SimHash

2. Opcode Frequency Vectors

3. Opcode Sequence Embeddings

4. Opcode Sequence Translation to Image Using dHash

5. LSTM Last Hidden State Translation to Image

Classifiers

1. MLP Classifiers

2. CNN Classifiers

3. LSTM Classifiers

License

About

Uh oh!

Releases

Packages

Languages

License

kooroshsajadi/malware_classification_using_deep_learning

Folders and files

Latest commit

History

Repository files navigation

Malware Detection Using Deep Learning

Dataset

Data Preprocessing

1. Unpacking PE Files

2. Batch Processing with Ghidra Headless

3. Opcode Extraction Using Ghidra Scripts

Feature Engineering

1. Opcode Sequence Translation to Image Using SimHash

2. Opcode Frequency Vectors

3. Opcode Sequence Embeddings

4. Opcode Sequence Translation to Image Using dHash

5. LSTM Last Hidden State Translation to Image

Classifiers

1. MLP Classifiers

2. CNN Classifiers

3. LSTM Classifiers

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages