Spam filter #64

bnewbold · 2020-11-03T17:44:22Z

It would be useful to have a naive function that looks at release metadata and detects gratuitous spam. In theory upstream partner sources should be able to catch spam, but, eg, today Zenodo had more than 25,000 spam DOIs and PDFs registered:

https://fatcat.wiki/release/search?q=doi_prefix%3A10.5281+date%3A2020-11-02

Most of these have terms like [PDF], EPUB, D.O.W.N.L.O.A.D, etc, which seem like simple statistical spam detection could find. The goal wouldn't be to make something impenetrable, just to prevent large batches from getting imported. If we had such a function in one place, we could add additional patterns over time, and reuse the function in both automated bot imports (eg, like datacite DOI metadata here) and in a review bot for human edits.

The text was updated successfully, but these errors were encountered:

AniketShahane · 2021-02-21T15:12:51Z

@bnewbold Hey I'd like to work on this. So I was thinking we could implement something extremely simple like a naive bayes classifier. But before that I've scraped all those spam messages and created a csv file that contains the occurrences of words in the spam messages. Is there a way we can right away make use of that here and then go on to work on the naive bayes classifier?

bnewbold · 2021-02-23T03:07:30Z

One idea would be to create a function that identifies all (or almost all) of the spam DOIs, and none from a random subset of "real" releases.

The best current place to implement this would be in the "importer" "common" file: https://github.com/internetarchive/fatcat/blob/master/python/fatcat_tools/importers/common.py#L115

A new function like is_spam_release(obj: ReleaseEntity) -> boolean. Maybe should be a method on the EntityImporter class if state needs to be loaded (eg, a file of patterns). I think operating on the release object instead of on a string makes the most sense; initially the function could only check the title field, but could get more or less complex in the future. If you implement such a function, I can wire it up with our existing importer code paths in a separate commit/PR.

Implementation should include tests. These can be unit tests of just the behavior of the one function (as opposed to integration tests), and don't need to be exhaustive (eg, just a few representative example releases probably enough).

bnewbold added enhancement New feature or request help wanted Good tasks for external contributors labels Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spam filter #64

Spam filter #64

bnewbold commented Nov 3, 2020

AniketShahane commented Feb 21, 2021

bnewbold commented Feb 23, 2021

Spam filter #64

Spam filter #64

Comments

bnewbold commented Nov 3, 2020

AniketShahane commented Feb 21, 2021

bnewbold commented Feb 23, 2021