Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which dataset is used to train and test the performance ? #1

Open
vietvo89 opened this issue Apr 4, 2021 · 6 comments
Open

Which dataset is used to train and test the performance ? #1

vietvo89 opened this issue Apr 4, 2021 · 6 comments

Comments

@vietvo89
Copy link

vietvo89 commented Apr 4, 2021

Hello

In your paper, it seems you only use EMBER for PE files while using Common Crawl to collect benign PDF files and VirusShare for malicous ones. For EMBER dataset, did you use its raw binary dataset or extracted data provided on GitHub repo? EMBER is a dataset that does not contain raw binary files. But I think Malconv and your proposal need raw binary files. I am carrying out research of attack side so I need to train a malware detection model.

Thanks

@RaffEdwardBAH
Copy link

We used the raw binary dataset for EMBER. Since we partnered with Elastic on this research that was not too much of an issue. If your research institution can get a VirusTotal license you can get all the raw EMBER files that way. There is also some pre-trained weights in the repo.

If you need to train from scratch, I'd recommend swapping to malware family classification. I would think the results would be highly comparable. You could use VirusShare + @seymour1 's labeling project https://github.com/seymour1/label-virusshare + AVClass to get a bunch of families. The new Sophos dataset https://github.com/sophos-ai/SOREL-20M is also an option. While they do not make the benign files available, the malicious ones are, and they have some functionality/family type information available to use as well.

@vietvo89
Copy link
Author

vietvo89 commented Apr 6, 2021

Thank Raff, I have just found the website for Sophos dataset today. I think there are plenty of ways to collect malware but it seems that in malware research community will not use some public dataset to compare and benchmark. So various researchers have their own dataset that may take time to collect and report the reliability of that dataset. It is not like Vision domain where researchers have some large and reliable public datasets. So EMBER is fantastic in terms of large and reliable public dataset but many studies requires raw binary files and this could be a limit to its progress.

I have access to download malware from VirusTotal but they do not allow me to query amd download based on hash number since I have academic access only. I found some pages like http://www.portablefreeware.com/ to download benign software manually but if I need thousands of samples, it could be a big problem.

@RaffEdwardBAH
Copy link

seems that in malware research community will not use some public dataset to compare and benchmark.

Its not that people don't want to, its a legal problem that they generally can't. A good representative benign corpus has lots of executable programs that people install in different environments, develop internally, and more. But in every one of those cases, the executable is usually either: 1) a product that is sold for a fee, and the owner would not want distributed for free, 2) an intrinsically internals tool or product, which may or may not be considered proprietary, and not want distributed. In either case, copyright laws apply, and the data just can't be shared. Its a huge challenge within this field that has only recently started to make better progress with stuff like EMBER and SOREL-20M, but we've got a long way to go.

I found some pages like http://www.portablefreeware.com/ to download benign software manually but if I need thousands of samples, it could be a big problem.

Unfortunately stuff like that will not get you anywhere near the number of executables you need, or produce a representative corpus that generalizes to real-world data. This is actually something I invested in my first paper.

@vietvo89
Copy link
Author

vietvo89 commented Apr 6, 2021

Thank Raff, I see your points. I just read this paper yesterday and now realize that you are the author of that paper too. I was surprised how you could get tons of MS window files. Anyway, for any research, data is the first crucial step to get somewhere. Due to some limitations I cannot obtain thousands of benign samples in an effective way. If you have figured out a way to collect a large number of benign samples from somewhere or you can release your dataset for research purpose, it would be great for me and benefits my research. Because this is my PhD topic right now and I want to carry out both attack and defense sides.

@RaffEdwardBAH
Copy link

RaffEdwardBAH commented Apr 6, 2021 via email

@vietvo89
Copy link
Author

vietvo89 commented Apr 6, 2021

:)) I see your point. Let me take a close look at your recommendation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants