This dataset represents a collection of PE file behaviors generated from Sysmon using Cuckoo Sandbox as a malware analysis tool. The dataset comprises 10,414 PE malware samples and 12,370 PE benign samples obtained from VirusShare and snap.
If you use this dataset and find it useful, please cite the following paper
@inproceedings{pratomo2024barongtrace,
title={BarongTrace: A Malware Event Log Dataset for Linux},
author={Pratomo, Baskoro Adi and Kosim, Stefanus A and Studiawan, Hudan and Prabowo, Angela O},
booktitle={International Conference on Advanced Information Networking and Applications},
pages={48--60},
year={2024},
organization={Springer}
}
This repository stores execution scripts that are run during the dataset retrieval process. Prior to executing the scripts in this repository, the analysis machine and Cuckoo must already be installed on the computer. If the prerequisites have been met, proceed to execute the following scripts sequentially (Ensure that all code indicating file paths are adjusted accordingly):
- run.sh This script will submit the collected files. It requires activating the virtual environment if Cuckoo is installed within a virtual environment. The variables that need to be modified are on lines 3-8, 12, 14, 25, and 27 to adjust the file paths according to each user's system.
- move_result.sh This script will move the Cuckoo analysis results to the destination directory specified in line 13. The location of the Cuckoo analysis results is adjusted in line 2. The script differentiates between malware and benign file types on line 3.
- merge_xml.sh This script is used to combine every 100 XML files that have been moved using the stage 2 script. The XML files to be combined are separated between benign and malware, which are then merged on lines 25 and 26.
- convert_xml_to_csv.py This script will convert XML files into CSV files. It requires a virtual environment with modules specified in requirement.txt.
- merge_csv.py This script is used to combine every 20 CSV files from the previous stage's results. It requires a virtual environment with modules specified in requirement.txt
The generated dataset is located in the Dataset
directory within this repository. Inside this directory, there are three subdirectories representing different types of datasets based on their names:
- Mapped_All Events: This directory contains the dataset that has been mapped according to the Sysmon Windows dataset without removing any events. Contain 16,555,075 rows after removing duplicate entry using pandas.
- Mapped_Windows Event Only: This directory contains the dataset that has been mapped according to the Sysmon Windows dataset with the removal of events that were not present in the detection model using Sysmon Windows during training. Contain 12,420,236 rows after removing duplicate entry using pandas.
- Original_Linux: This directory contains the dataset without any mapping. Contain 16,555,075 rows after removing duplicate entry using pandas.
Column Name | Definition |
---|---|
Provider_Name | The provider name of the virtual machine used by Sysmon. |
Provider_Guid | The global unique identifier of the virtual machine used by Sysmon. |
EventID | The ID of the event recorded by Sysmon. |
Version | The version of the Sysmon configuration schema. |
Level | The label level of Sysmon. |
Task | The task obtained from the EBPF Kernel. |
Opcode | The operation code performed. |
Keywords | The keywords in hexadecimal form. |
TimeCreated_SystemTime | The initial time recorded when Sysmon logs behavior with reference to the system time. |
EventRecordID | The unique ID recorded by Sysmon to analyze the sequence of events that occurred. |
Correlation | The correlation between events. |
Execution_ProcessID | The ID of the executed process. |
Execution_ThreadID | The ID of the executing thread. |
Channel | The specific channel generated by Sysmon. |
Computer | The name of the computer used. |
Security_UserId | The user ID for security logging. |
RuleName | The name of the configured rule. |
UtcTime | The time the event was created in UTC. |
ProcessGuid | The Global Unique Identifier of the process running the ongoing process. |
ProcessId | The ID of the process running the ongoing process. |
Image | The filepath of the process. |
User | The name of the account running the process. |
TargetFilename | The path of the target file. |
CreationUtcTime | The time the process occurred in UTC. |
Hashes | The hash captured by the Sysmon driver. |
IsExecutable | A Boolean value indicating whether the file can be executed or not. |
Archived | A Boolean value indicating whether the file is stored in an archive format. |
FileVersion | The version of the image associated with the main process. |
Description | The description of the image associated with the main process. |
Product | The product name of the image associated with the main process. |
Company | The company name of the image associated with the main process. |
OriginalFileName | The original name of the file obtained from the portable executable header. |
CommandLine | The arguments executed for the main process. |
CurrentDirectory | The path of the related image without the image name. |
LogonGuid | The Global Unique Identifier of the user creating the process. |
LogonId | The ID of the user creating the process. |
TerminalSessionId | The ID of the user session. |
IntegrityLevel | The integrity label present in the process. |
ParentProcessGuid | The Global Unique Identifier of the process creating the main process. |
ParentProcessId | The ID of the process creating the main process. |
ParentImage | The filepath creating the main process. |
ParentCommandLine | The arguments used for execution related to the parent process. |
ParentUser | The name of the account creating the parent process. |
Protocol | The protocol used to connect to the network. |
Initiated | Indicates whether the process is initiated with a TCP connection. |
SourceIsIpv6 | A Boolean value indicating whether the source IP is IPv6 or not. |
SourceIp | The source IP address initiating the connection. |
SourceHostname | The DNS name of the host initiating the connection. |
SourcePort | The number of the source port. |
SourcePortName | The name of the source port. |
DestinationIsIpv6 | A Boolean value indicating whether the destination IP is IPv6 or not. |
DestinationIp | The destination IP address. |
DestinationHostname | The DNS name of the host being contacted. |
DestinationPort | The number of the destination port. |
DestinationPortName | The name of the destination port. |
SourceProcessGUID | The Global Unique Identifier of the source process opening another process. |
SourceProcessId | The process ID used by the operating system to identify the source process opening another process. |
SourceThreadId | The specific ID within the source process that opens another process. |
SourceImage | The filepath of the source process creating a thread in another process. |
TargetProcessId | The process ID used by the operating system to identify the target process. |
GrantedAccess | The access flags (bitmask) associated with the rights the target process requested. |
CallTrace | The stack trace of the called process. |
SourceUser | The name of the account running the source process. |
TargetUser | The name of the account running the target process being accessed. |