paper_search with real paper search #1193

AmT42 · 2025-11-11T15:21:19Z

AmT42
Nov 11, 2025

Hello,

I understand that internally you already have a real paper search function and citation traversal. I assume the reason these aren’t open-sourced yet may be related to licensing or legal constraints.

If I implement these features myself, can I open a pull request to the main repository, or should I keep it in a separate repo?

thanks

jamesbraza · 2025-11-11T18:36:05Z

jamesbraza
Nov 11, 2025
Maintainer

We didn't add citation traversal yet cuz when you're in a folder of your own PDFs the chance of a traversed citation being there is low. That being said, if we add a real paper search (not just one searching local PDFs) then citation traversal becomes relevant.

One question is, if you add a paper search, how do you get the papers in the search results? Will you make a scraping functionality too?

If you feel like adding a paper search or citation traversal tool, feel free to open a PR, just make sure there's good unit tests added

0 replies

AmT42 · 2025-11-12T20:49:43Z

AmT42
Nov 12, 2025
Author

Thanks for the reply!

I put together a quick prototype over the weekend because I needed paper search urgently. Right now it is a small wrapper around OpenAlex with a basic PDF resolver (using the HTML or direct PDF URL). It works, but it still needs several improvements before it is really solid, for example:

better handling of JATS, full-text XML, and structured HTML

stronger PDF detection (redirects, cookies, multiple candidate links)

smarter prioritization using license status, Crossref metadata, PMC links, etc.

source-specific logic for tricky providers; for instance, for some PMC links I had to bypass a PoW gate, and I ended up using a loop with an AI agent (gpt-5-codex) that kept trying download strategies until it succeeded, then I asked it to output the final working script

Given there are around 42.5k peer-reviewed journals (STM 2018), ~22k of them full OA, plus preprint servers, indices like PubMed, and repositories like PMC Europe, I am pretty sure we can automate more of this and gradually build a growing list of supported sources, especially if the OS community contributes.

I am going to keep improving my prototype. Do you think it makes more sense to open an early PR and iterate together, or would you rather I wait until the paper search and fetching are more complete and robust before submitting a PR?

1 reply

AmT42 Nov 12, 2025
Author

Also On my initial tests, it can fetch the PDF for roughly 75% of papers that are open access and not behind a CAPTCHA

AmT42 · 2025-11-12T20:51:41Z

AmT42
Nov 12, 2025
Author

Btw I didn't any about about literature/OA ecosystem before this weekend so I’m sure I’m missing many known "trics" or methods

0 replies

conjon42 · 2025-11-20T07:41:27Z

conjon42
Nov 20, 2025

This is just a half-baked idea, but Zotero has had a lot of work put into finding full texts, both in the native app and, using separate strategies, through a browser extension (Zotero Connector). They have, as I recall, different protocols in place to prevent getting rate limited, allow user to interact with CAPTCHA, etc. This works pretty well to retrieve full-texts that an individual has access to through their own specific institution subscriptions. Their implementation is written primarily in JavaScript rather than Python, but perhaps it could be useful to get a basic workflow?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

paper_search with real paper search #1193

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

paper_search with real paper search #1193

Uh oh!

AmT42 Nov 11, 2025

Replies: 4 comments · 1 reply

Uh oh!

jamesbraza Nov 11, 2025 Maintainer

Uh oh!

AmT42 Nov 12, 2025 Author

Uh oh!

AmT42 Nov 12, 2025 Author

Uh oh!

AmT42 Nov 12, 2025 Author

Uh oh!

conjon42 Nov 20, 2025

AmT42
Nov 11, 2025

Replies: 4 comments 1 reply

jamesbraza
Nov 11, 2025
Maintainer

AmT42
Nov 12, 2025
Author

AmT42 Nov 12, 2025
Author

AmT42
Nov 12, 2025
Author

conjon42
Nov 20, 2025