Skip to content

bazz-066/linux-malware-dataset

Repository files navigation

BarongTrace: Linux Malware Dataset

Introduction

This dataset represents a collection of PE file behaviors generated from Sysmon using Cuckoo Sandbox as a malware analysis tool. The dataset comprises 10,414 PE malware samples and 12,370 PE benign samples obtained from VirusShare and snap.

If you use this dataset and find it useful, please cite the following paper

@inproceedings{pratomo2024barongtrace,
  title={BarongTrace: A Malware Event Log Dataset for Linux},
  author={Pratomo, Baskoro Adi and Kosim, Stefanus A and Studiawan, Hudan and Prabowo, Angela O},
  booktitle={International Conference on Advanced Information Networking and Applications},
  pages={48--60},
  year={2024},
  organization={Springer}
}

Architecture

Cuckoo Sandbox Architecture

Cuckoo Sandbox Analysis Process

How to Use

This repository stores execution scripts that are run during the dataset retrieval process. Prior to executing the scripts in this repository, the analysis machine and Cuckoo must already be installed on the computer. If the prerequisites have been met, proceed to execute the following scripts sequentially (Ensure that all code indicating file paths are adjusted accordingly):

  1. run.sh This script will submit the collected files. It requires activating the virtual environment if Cuckoo is installed within a virtual environment. The variables that need to be modified are on lines 3-8, 12, 14, 25, and 27 to adjust the file paths according to each user's system.
  2. move_result.sh This script will move the Cuckoo analysis results to the destination directory specified in line 13. The location of the Cuckoo analysis results is adjusted in line 2. The script differentiates between malware and benign file types on line 3.
  3. merge_xml.sh This script is used to combine every 100 XML files that have been moved using the stage 2 script. The XML files to be combined are separated between benign and malware, which are then merged on lines 25 and 26.
  4. convert_xml_to_csv.py This script will convert XML files into CSV files. It requires a virtual environment with modules specified in requirement.txt.
  5. merge_csv.py This script is used to combine every 20 CSV files from the previous stage's results. It requires a virtual environment with modules specified in requirement.txt

Dataset

The generated dataset is located in the Dataset directory within this repository. Inside this directory, there are three subdirectories representing different types of datasets based on their names:

  • Mapped_All Events: This directory contains the dataset that has been mapped according to the Sysmon Windows dataset without removing any events. Contain 16,555,075 rows after removing duplicate entry using pandas.
  • Mapped_Windows Event Only: This directory contains the dataset that has been mapped according to the Sysmon Windows dataset with the removal of events that were not present in the detection model using Sysmon Windows during training. Contain 12,420,236 rows after removing duplicate entry using pandas.
  • Original_Linux: This directory contains the dataset without any mapping. Contain 16,555,075 rows after removing duplicate entry using pandas.
Column Name Definition
Provider_Name The provider name of the virtual machine used by Sysmon.
Provider_Guid The global unique identifier of the virtual machine used by Sysmon.
EventID The ID of the event recorded by Sysmon.
Version The version of the Sysmon configuration schema.
Level The label level of Sysmon.
Task The task obtained from the EBPF Kernel.
Opcode The operation code performed.
Keywords The keywords in hexadecimal form.
TimeCreated_SystemTime The initial time recorded when Sysmon logs behavior with reference to the system time.
EventRecordID The unique ID recorded by Sysmon to analyze the sequence of events that occurred.
Correlation The correlation between events.
Execution_ProcessID The ID of the executed process.
Execution_ThreadID The ID of the executing thread.
Channel The specific channel generated by Sysmon.
Computer The name of the computer used.
Security_UserId The user ID for security logging.
RuleName The name of the configured rule.
UtcTime The time the event was created in UTC.
ProcessGuid The Global Unique Identifier of the process running the ongoing process.
ProcessId The ID of the process running the ongoing process.
Image The filepath of the process.
User The name of the account running the process.
TargetFilename The path of the target file.
CreationUtcTime The time the process occurred in UTC.
Hashes The hash captured by the Sysmon driver.
IsExecutable A Boolean value indicating whether the file can be executed or not.
Archived A Boolean value indicating whether the file is stored in an archive format.
FileVersion The version of the image associated with the main process.
Description The description of the image associated with the main process.
Product The product name of the image associated with the main process.
Company The company name of the image associated with the main process.
OriginalFileName The original name of the file obtained from the portable executable header.
CommandLine The arguments executed for the main process.
CurrentDirectory The path of the related image without the image name.
LogonGuid The Global Unique Identifier of the user creating the process.
LogonId The ID of the user creating the process.
TerminalSessionId The ID of the user session.
IntegrityLevel The integrity label present in the process.
ParentProcessGuid The Global Unique Identifier of the process creating the main process.
ParentProcessId The ID of the process creating the main process.
ParentImage The filepath creating the main process.
ParentCommandLine The arguments used for execution related to the parent process.
ParentUser The name of the account creating the parent process.
Protocol The protocol used to connect to the network.
Initiated Indicates whether the process is initiated with a TCP connection.
SourceIsIpv6 A Boolean value indicating whether the source IP is IPv6 or not.
SourceIp The source IP address initiating the connection.
SourceHostname The DNS name of the host initiating the connection.
SourcePort The number of the source port.
SourcePortName The name of the source port.
DestinationIsIpv6 A Boolean value indicating whether the destination IP is IPv6 or not.
DestinationIp The destination IP address.
DestinationHostname The DNS name of the host being contacted.
DestinationPort The number of the destination port.
DestinationPortName The name of the destination port.
SourceProcessGUID The Global Unique Identifier of the source process opening another process.
SourceProcessId The process ID used by the operating system to identify the source process opening another process.
SourceThreadId The specific ID within the source process that opens another process.
SourceImage The filepath of the source process creating a thread in another process.
TargetProcessId The process ID used by the operating system to identify the target process.
GrantedAccess The access flags (bitmask) associated with the rights the target process requested.
CallTrace The stack trace of the called process.
SourceUser The name of the account running the source process.
TargetUser The name of the account running the target process being accessed.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published