You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be useful to have a naive function that looks at release metadata and detects gratuitous spam. In theory upstream partner sources should be able to catch spam, but, eg, today Zenodo had more than 25,000 spam DOIs and PDFs registered:
Most of these have terms like [PDF], EPUB, D.O.W.N.L.O.A.D, etc, which seem like simple statistical spam detection could find. The goal wouldn't be to make something impenetrable, just to prevent large batches from getting imported. If we had such a function in one place, we could add additional patterns over time, and reuse the function in both automated bot imports (eg, like datacite DOI metadata here) and in a review bot for human edits.
The text was updated successfully, but these errors were encountered:
@bnewbold Hey I'd like to work on this. So I was thinking we could implement something extremely simple like a naive bayes classifier. But before that I've scraped all those spam messages and created a csv file that contains the occurrences of words in the spam messages. Is there a way we can right away make use of that here and then go on to work on the naive bayes classifier?
A new function like is_spam_release(obj: ReleaseEntity) -> boolean. Maybe should be a method on the EntityImporter class if state needs to be loaded (eg, a file of patterns). I think operating on the release object instead of on a string makes the most sense; initially the function could only check the title field, but could get more or less complex in the future. If you implement such a function, I can wire it up with our existing importer code paths in a separate commit/PR.
Implementation should include tests. These can be unit tests of just the behavior of the one function (as opposed to integration tests), and don't need to be exhaustive (eg, just a few representative example releases probably enough).
It would be useful to have a naive function that looks at release metadata and detects gratuitous spam. In theory upstream partner sources should be able to catch spam, but, eg, today Zenodo had more than 25,000 spam DOIs and PDFs registered:
https://fatcat.wiki/release/search?q=doi_prefix%3A10.5281+date%3A2020-11-02
Most of these have terms like
[PDF]
, EPUB,D.O.W.N.L.O.A.D
, etc, which seem like simple statistical spam detection could find. The goal wouldn't be to make something impenetrable, just to prevent large batches from getting imported. If we had such a function in one place, we could add additional patterns over time, and reuse the function in both automated bot imports (eg, like datacite DOI metadata here) and in a review bot for human edits.The text was updated successfully, but these errors were encountered: