Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically import retraction metadata #62

Open
bnewbold opened this issue Oct 14, 2020 · 3 comments
Open

Automatically import retraction metadata #62

bnewbold opened this issue Oct 14, 2020 · 3 comments
Labels
content Bulk imports and updates to existing production catalog help wanted Good tasks for external contributors

Comments

@bnewbold
Copy link
Member

There is a database of retracted papers at: http://retractiondatabase.org/RetractionSearch.aspx?&AspxAutoDetectCookieSupport=1

It would be good to have a bot which periodically fetches updates, and then updates article metadata in fatcat appropriately.

@bnewbold bnewbold added content Bulk imports and updates to existing production catalog help wanted Good tasks for external contributors labels Oct 14, 2020
@hs2361
Copy link

hs2361 commented Oct 18, 2020

I would like to work on this. Could you provide some more details? What kind of mechanism can be used to fetch the data from their database? They have clearly mentioned that scraping the website is prohibited (https://retractionwatch.com/retraction-watch-database-user-guide/).

@bnewbold
Copy link
Member Author

Ah, I didn't notice that. The services are on different domains so I didn't realize they were the same project, but now I see the "User Guide" link.
I guess the next step would be to find alternative sources of retraction metadata with persistent identifiers (eg, DOI or PubMed identifier). Some sources I can think of are:

  • PubMed/MEDLINE itself (we already have a parser for this, could update the import pipeline to allow "updates" to existing entries when the publication_stage does not match or has changed to "retracted")
  • publisher-specific corpuses, like SciGraph
  • heristics, like finding publications with the title "Retraction of TITLE", then finding the prior publication from the same journal ("container") and the given title

@bnewbold
Copy link
Member Author

bnewbold commented Feb 4, 2021

Here is an open corpus of ~100k retractions: http://openretractions.com/

we only know about retractions and other updates that publishers have properly reported to CrossRef or PubMed. That's currently 114596 papers.

I see only a couple thousand retracted "releases" in fatcat today. We do import from crossref and pubmed, so in theory we should have comparable numbers, but we don't run updates automatically yet, so if most of these are from the past couple years we are probably missing them. Also there might be bugs in our crossref and pubmed importers. I don't think we have tests for that code path, so a good first contribution would be adding tests for both crossref and pubmed retractions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Bulk imports and updates to existing production catalog help wanted Good tasks for external contributors
Projects
None yet
Development

No branches or pull requests

2 participants