Replies: 4 comments 1 reply
-
|
We didn't add citation traversal yet cuz when you're in a folder of your own PDFs the chance of a traversed citation being there is low. That being said, if we add a real paper search (not just one searching local PDFs) then citation traversal becomes relevant. One question is, if you add a paper search, how do you get the papers in the search results? Will you make a scraping functionality too? If you feel like adding a paper search or citation traversal tool, feel free to open a PR, just make sure there's good unit tests added |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the reply! I put together a quick prototype over the weekend because I needed paper search urgently. Right now it is a small wrapper around OpenAlex with a basic PDF resolver (using the HTML or direct PDF URL). It works, but it still needs several improvements before it is really solid, for example: better handling of JATS, full-text XML, and structured HTML stronger PDF detection (redirects, cookies, multiple candidate links) smarter prioritization using license status, Crossref metadata, PMC links, etc. source-specific logic for tricky providers; for instance, for some PMC links I had to bypass a PoW gate, and I ended up using a loop with an AI agent (gpt-5-codex) that kept trying download strategies until it succeeded, then I asked it to output the final working script Given there are around 42.5k peer-reviewed journals (STM 2018), ~22k of them full OA, plus preprint servers, indices like PubMed, and repositories like PMC Europe, I am pretty sure we can automate more of this and gradually build a growing list of supported sources, especially if the OS community contributes. I am going to keep improving my prototype. Do you think it makes more sense to open an early PR and iterate together, or would you rather I wait until the paper search and fetching are more complete and robust before submitting a PR? |
Beta Was this translation helpful? Give feedback.
-
|
Btw I didn't any about about literature/OA ecosystem before this weekend so I’m sure I’m missing many known "trics" or methods |
Beta Was this translation helpful? Give feedback.
-
|
This is just a half-baked idea, but Zotero has had a lot of work put into finding full texts, both in the native app and, using separate strategies, through a browser extension (Zotero Connector). They have, as I recall, different protocols in place to prevent getting rate limited, allow user to interact with CAPTCHA, etc. This works pretty well to retrieve full-texts that an individual has access to through their own specific institution subscriptions. Their implementation is written primarily in JavaScript rather than Python, but perhaps it could be useful to get a basic workflow? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I understand that internally you already have a real paper search function and citation traversal. I assume the reason these aren’t open-sourced yet may be related to licensing or legal constraints.
If I implement these features myself, can I open a pull request to the main repository, or should I keep it in a separate repo?
thanks
Beta Was this translation helpful? Give feedback.
All reactions