Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spam filter #64

Open
bnewbold opened this issue Nov 3, 2020 · 2 comments
Open

Spam filter #64

bnewbold opened this issue Nov 3, 2020 · 2 comments
Labels
enhancement New feature or request help wanted Good tasks for external contributors

Comments

@bnewbold
Copy link
Contributor

bnewbold commented Nov 3, 2020

It would be useful to have a naive function that looks at release metadata and detects gratuitous spam. In theory upstream partner sources should be able to catch spam, but, eg, today Zenodo had more than 25,000 spam DOIs and PDFs registered:

https://fatcat.wiki/release/search?q=doi_prefix%3A10.5281+date%3A2020-11-02

Most of these have terms like [PDF], EPUB, D.O.W.N.L.O.A.D, etc, which seem like simple statistical spam detection could find. The goal wouldn't be to make something impenetrable, just to prevent large batches from getting imported. If we had such a function in one place, we could add additional patterns over time, and reuse the function in both automated bot imports (eg, like datacite DOI metadata here) and in a review bot for human edits.

@bnewbold bnewbold added enhancement New feature or request help wanted Good tasks for external contributors labels Nov 3, 2020
@AniketShahane
Copy link

@bnewbold Hey I'd like to work on this. So I was thinking we could implement something extremely simple like a naive bayes classifier. But before that I've scraped all those spam messages and created a csv file that contains the occurrences of words in the spam messages. Is there a way we can right away make use of that here and then go on to work on the naive bayes classifier?

@bnewbold
Copy link
Contributor Author

One idea would be to create a function that identifies all (or almost all) of the spam DOIs, and none from a random subset of "real" releases.

The best current place to implement this would be in the "importer" "common" file: https://github.com/internetarchive/fatcat/blob/master/python/fatcat_tools/importers/common.py#L115

A new function like is_spam_release(obj: ReleaseEntity) -> boolean. Maybe should be a method on the EntityImporter class if state needs to be loaded (eg, a file of patterns). I think operating on the release object instead of on a string makes the most sense; initially the function could only check the title field, but could get more or less complex in the future. If you implement such a function, I can wire it up with our existing importer code paths in a separate commit/PR.

Implementation should include tests. These can be unit tests of just the behavior of the one function (as opposed to integration tests), and don't need to be exhaustive (eg, just a few representative example releases probably enough).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Good tasks for external contributors
Projects
None yet
Development

No branches or pull requests

2 participants