The dataset used for model training and evaluation consists of 8,970 malware samples in Portable Executable (PE) format. These samples are categorized into the following malware families:
| Malware Family | Sample Count |
|---|---|
| Locker | 330 |
| Mediyes | 1,450 |
| WinWebSec | 4,400 |
| Zbot | 2,100 |
| ZeroAccess | 690 |
The dataset is divided into training and test sets using a 70/30 split and is available at the below link.
Training Set:
- Locker: 231 samples
- Mediyes: 1,015 samples
- WinWebSec: 3,080 samples
- Zbot: 1,470 samples
- ZeroAccess: 483 samples
Test Set:
- Locker: 99 samples
- Mediyes: 435 samples
- WinWebSec: 1,320 samples
- Zbot: 630 samples
- ZeroAccess: 207 samples
Each sample is labeled according to its malware family. The dataset is used for opcode-based feature extraction and classification tasks.
The data preprocessing pipeline ensures the proper extraction of operation codes (opcodes) from malware files for further analysis and classification. The process is carried out in three main steps using Ghidra Headless and supporting scripts.
Before analyzing the Portable Executable (PE) files, they need to be unpacked. This is done using UPX, a widely used packer. The script upx_unpacking.sh provides an example of how to automate this process.
To extract opcodes from the unpacked files, we utilize Ghidra Headless, the non-GUI version of Ghidra, which allows batch processing of multiple files. The script ghidra_analysis_xmlgen_and_opcodesext.sh automates this process by iterating over the files and executing the necessary extraction steps.
During the batch processing phase, two Java scripts are executed sequentially:
ExportToXML.java: Converts each analyzed file's architecture into an XML format.ExtractOpcodesFromCodeSection.java: Reads the addresses from the XML file that point to the code section opcodes, extracts them, and saves the output in a text file for further use.
Both of these scripts are located in the utils folder.
This structured preprocessing ensures that extracted opcodes are correctly formatted and ready for the next phase of AI-driven malware classification.
5. LSTM Last Hidden State Translation to Image
This project is licensed under the MIT License. See the LICENSE file for more information.