September 2024 release #362

whitead · 2024-09-11T06:00:21Z

This comment is a draft for release notes below (this will be a merge commit)

New Features

Automatic population of metadata: PDF metadata is automatically retrieved from a variety of providers, including adding bibtex, citation counts, journal quality assessments, and noting retractions
full-text search: A major difference between our published work and this repo is ability to search over all of scientific literature. We've brought the OSS version closer by adding full-text keyword search via tantivy. Now you can index and search many papers before embdding, making it feasible to ingest many papers.
unified settings management: You can now save/load settings and that makes it easier for us to distribute settings reflecting various tasks with PaperQA2. Examples are writing wikipedia articles, identifying contradictions, and obtaining structured data
CLI: We've made a CLI that uses persistent parsings/indexes and makes it much easier to just ask questions of a folder of PDFs
Litellm: We've adopted litellm as the LLM wrapper of choice. This means we now support many LLM APIs directly with only the model string changing. It also means we have "routers" now that can do fallbacks, api rate limiting, and retries.

Improvements

More modern agent frameworks
Reduction in dependencies
Removed code duplicated by litellm
Many improvements on code style and best practices

Regressions/Deprecation

We've removed the following features to keep our library focused:

doc_match - we do not have enough data to support that this method actually helps for very large corpuses
LangchainVectorStore - We no longer support more complex vector stores via Langchain like FAISS. Instead, we only support Numpy vector stores. We never found the paradigm of very large vector stores to be better than keyword search -> vector search -> LLM reranking and thus removed the code

Detailed Changes:

typo by @oganm in typo #303
Updated readme and models by @mskarlin in Updated readme and models #305
Add Client (external API) Module For Enhanced Metadata by @mskarlin in Add Client (external API) Module For Enhanced Metadata #306
Agentic workflows, locally indexed search, and CLI by @mskarlin in Agentic workflows, locally indexed search, and CLI #309
Add new unpaywall provider by @mskarlin in Add new unpaywall provider #310
Rollback search fields to list and dynamically compute md5 hash in tests by @mskarlin in Rollback search fields to list and dynamically compute md5 hash in tests #311
Refactor to breakout config from rest of code by @whitead in Refactor to breakout config from rest of code #289
Changed to rely on litellm for computing cost by @whitead in Changed to rely on litellm for computing cost #321
Fixing LLMModel.axyz_iter type hints by @jamesbraza in Fixing LLMModel.axyz_iter type hints #324
CLI Fixes by @whitead in CLI Fixes #322
blackened code to prevent IDE scrolling by @jamesbraza in blackened code to prevent IDE scrolling #330
Optimized import paths by @jamesbraza in Optimized import paths #331
Removed pytest-mock plugin by @jamesbraza in Removed pytest-mock plugin #328
Adding pytest-xdist plugin by @jamesbraza in Adding pytest-xdist plugin #329
Passing mypy by @jamesbraza in Passing mypy #332
Removing make_chain in favor of run_prompt by @jamesbraza in Removing make_chain in favor of run_prompt #325
Readme updates by @mskarlin in Readme updates #323
Adding refurb tool, and lint CI by @jamesbraza in Adding refurb tool, and lint CI #333
Fixing arg ordering after Removing make_chain in favor of run_prompt #325 by @jamesbraza in Fixing arg ordering after #325 #334
Fixing parse_text after Passing mypy #332 by @jamesbraza in Fixing parse_text after #332 #335
Fixing union attr error by @jamesbraza in Fixing union attr error #338
Check if a journal name starts with the by @geemi725 in Check if a journal name starts with the #320
Fixing two more tests by @jamesbraza in Fixing two more tests #340
All Ruff ANN autofixes by @jamesbraza in All Ruff ANN autofixes #341
Adding in .mailmap by @jamesbraza in Adding in .mailmap #342
Remove cassettes which aren't needed by @mskarlin in Remove cassettes which aren't needed #339
Add configs for contracrow + wikicrow by @mskarlin in Add configs for contracrow + wikicrow #336
Removed LangchainVectorStore, llms extra, and fixing up README by @jamesbraza in Removed LangchainVectorStore, llms extra, and fixing up README #343
Dropping requests dependency by @jamesbraza in Dropping requests dependency #346
Removed html2text requirement by @jamesbraza in Removed html2text requirement #347
Requiring Python 3.11+ by @jamesbraza in Requiring Python 3.11+ #348
Did one revision at README by @whitead in Did one revision at README #344
Renaming fitz to pymupdf by @mskarlin in Renaming fitz to pymupdf #350
Better control flow in litellm_get_search_query by @jamesbraza in Better control flow in litellm_get_search_query #351
Recurse into directories; catch empty documents by @sidnarayanan in Recurse into directories; catch empty documents #352
Move configure_cli_logging such that it's not called twice by @mskarlin in Move configure_cli_logging such that it's not called twice #353
Cleaning up dependencies by @jamesbraza in Cleaning up dependencies #354
Fixed code in README by @whitead in Fixed code in README #355
Added citation and paper URL by @whitead in Added citation and paper URL #357
aviary and ldp for agents over langchain by @jamesbraza in aviary and ldp for agents over langchain #358
Adds retraction status by @geemi725 in Adds retraction status #314
Adding pylint by @jamesbraza in Adding pylint #349
Added account for cost info by @whitead in Added account for cost info #360

New Contributors

@oganm made their first contribution in typo #303
@geemi725 made their first contribution in Check if a journal name starts with the #320
@sidnarayanan made their first contribution in Recurse into directories; catch empty documents #352

Full Changelog: v4.9.0...vnew

* add agentic workflow module and cli * update pyproject.toml and split google into its own import * remove unused import * update default search chain * Update __init__.py * Update search.py * move reqs into pyproject.toml, addressing PR comments * rename search_chain to openai_get_search_query * remove get_llm_name * move table_formatter to helpers.py * Update paperqa/agents/main.py Co-authored-by: James Braza <[email protected]> * Update paperqa/agents/main.py Co-authored-by: James Braza <[email protected]> * Update paperqa/agents/main.py Co-authored-by: James Braza <[email protected]> * Update paperqa/agents/docs.py Co-authored-by: James Braza <[email protected]> * Update paperqa/types.py Co-authored-by: James Braza <[email protected]> * rename compute_cost to compute_total_model_token_cost * remove stream_answer * rename to stub_manifest, and use Path for all paths * Update paperqa/llms.py Co-authored-by: James Braza <[email protected]> * move SKIP_AGENT_TESTS = False * nix _ = assignments * add test comments * types in conftest.py * split libs into llms * link openai chat timeout to query.timeout * Update paperqa/agents/__init__.py Co-authored-by: James Braza <[email protected]> * logging revamp and renaming * Update tests/test_cli.py Co-authored-by: James Braza <[email protected]> * Update tests/test_cli.py Co-authored-by: James Braza <[email protected]> * move vertex import to func call, add docstring to SupportsPickle * docstring * remove _ = * remove bool return type from set * update gitignore * add config attribute to baase LLMModel class * replace get_current_settings -> get_settings * replace get_current_settings -> get_settings * PR simplifications * remove all stream_* functions * avoid modifying the root logger * re-organize logger import location * move hashlib into utils * refactor strip_answer into Answer object * label circular imports * ensure absolute paths are used in index name * limit select to be used only when DOI is not present in crossref * Update paperqa/agents/search.py Co-authored-by: James Braza <[email protected]> * Update paperqa/agents/search.py Co-authored-by: James Braza <[email protected]> * Update paperqa/agents/search.py Co-authored-by: James Braza <[email protected]> * Update paperqa/agents/search.py Co-authored-by: James Braza <[email protected]> * Update paperqa/agents/models.py Co-authored-by: James Braza <[email protected]> * reconfigure logging to not prevent propagation * remove newlines in the current year * use required fields as a subset * replace . with Path.cwd() --------- Co-authored-by: James Braza <[email protected]>

* add unpaywall provider * remove unused clean query method, update test cassettes to use [email protected]

…tests (#311) * making SearchIndex.fields a list * replace ids with md5 sum * rename stub for test_agent_sharing_state * run github actions on PRs into release branches too * ruff formatting * remove nested quotes for <3.12 * added staging data into tests; reimplemented tests to avoid downloads and keep consistent hashes; aded env vars to test.yaml * conftest removes extension from near match filenames * add match_on for sequential runs since they may be out of order * refresh casettes * replace deprecated pytest-vcr with pytest-recording * change matching criteria * set unpaywall email for CI * try explicit conditions on cassette files, set the cassette dir * removed vcr for sequential client tests * add fixture to ensure that log levels are reset * added cassettes back in for sequential tests after determining it was a log level failure * limit cli logger mutation to avoid clobbering other library defaults --------- Co-authored-by: Siddharth Narayanan <[email protected]>

Large refactor to decouple Docs, LLMs, and Config. Moves to centralized config that can be loaded from package data or defaults. This changes many function calls so that they pass config as an object or name of a config object. Removed doc_match and doc_index - these were only sometimes useful and not really part of current usage. We can add back if needed, but just complicated things. Removed the unnecessary complicated get_callbacks factories. Now, callbacks can have a kwarg of name to get access to name of chain being called. Switched to contextvars for setting answer_id in LLMResults so that we do not need to have so much back and forth for callbacks. Deferred updates to Answer objects until end of functions so that retrying is possible (except token counts) Generally moved all config out of Docs, except LLMModel which will be fixed in port to litellm Co-authored-by: Michael Skarlinski <[email protected]> Co-authored-by: James Braza <[email protected]> Co-authored-by: mskarlin <[email protected]>

I accidentally put an extra struct=True and broke everything

Co-authored-by: Michael Skarlinski <[email protected]>

Co-authored-by: James Braza <[email protected]>

Check if a journal name starts with `the`

…343)

mskarlin and others added 30 commits August 29, 2024 11:24

Add new unpaywall provider (#310)

6fb1ba4

* add unpaywall provider * remove unused clean query method, update test cassettes to use [email protected]

check if a journal name starts with instead

5aa213e

Added: UNDEFINED_JOURNAL_QUALITY

6238225

test updated

17e3d72

ignore codespell on title

c2c7a20

vcr mode: record new_episode

bb537f3

Merge branch 'september-2024-release' into issue-317

d653d9b

asyncio

f838f7b

Fixed extra strict

68a5baa

I accidentally put an extra struct=True and broke everything

Changed to rely on litellm for computing cost (#321)

e16fb79

Fixing LLMModel.axyz_iter type hints (#324)

2217a17

CLI Fixes (#322)

a9738c8

Co-authored-by: Michael Skarlinski <[email protected]>

blackened code to prevent IDE scrolling (#330)

6a0a603

Optimized import paths (#331)

b656b6e

Removed pytest-mock plugin (#328)

1da6c06

Adding pytest-xdist plugin (#329)

3b0c98a

Passing mypy (#332)

daf563c

Removing make_chain in favor of run_prompt (#325)

8a6e225

Readme updates (#323)

0c59bb0

Co-authored-by: James Braza <[email protected]>

Adding refurb tool, and lint CI (#333)

3b89e85

Fixing arg ordering after #325 (#334)

a681fe2

Fixing parse_text after #332 (#335)

67834be

Fixing union attr error (#338)

ac43efb

Merge pull request #320 from Future-House/issue-317

04d9052

Check if a journal name starts with `the`

Fixing two more tests (#340)

3cd4060

All Ruff ANN autofixes (#341)

86fbab6

Adding in .mailmap (#342)

e1eef48

mskarlin and others added 25 commits September 9, 2024 15:19

Remove cassettes which aren't needed (#339)

2443cbc

Add configs for contracrow + wikicrow (#336)

81b655e

Removed LangchainVectorStore, llms extra, and fixing up README (#…

01bb765

…343)

Dropping requests dependency (#346)

87fc609

Removed html2text requirement (#347)

76753b0

Requiring Python 3.11+ (#348)

99a4a63

Removed pypdf and adjusted README (#344)

a16ca1c

Renaming fitz to pymupdf (#350)

79f6a2f

Better control flow in litellm_get_search_query (#351)

71bf45a

promote agents into main module, add agents module level imports

fa9f194

Recurse into directories; catch empty documents (#352)

040d7a2

Move configure_cli_logging such that it's not called twice (#353)

080605d

Cleaning up dependencies (#354)

3614ca2

Removed dead import of NumpyVectorStore

30dad1a

Fixed callback code block in README (#355)

2160f38

Added citation and paper URL (#357)

2705bb9

aviary and ldp for agents over langchain (#358)

b9b9c22

Adds retraction status (#314)

4ba4386

Adding pylint (#349)

24060e8

Added account for cost info in agents (#360)

1bd6946

Fixed extra noise in logs from litellm routers (#363)

73d12ff

More digits on cost (#364)

573c97b

Added a more clear explanation of PaperQA2 (#361)

e1a8dad

Add high-quality settings (#365)

66e78b6

README adjustments (#367)

153d64a

whitead merged commit 45b206d into main Sep 11, 2024
2 of 3 checks passed

jamesbraza mentioned this pull request Sep 11, 2024

adding support for claude, azure, cohere, replicate #179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

September 2024 release #362

September 2024 release #362

whitead commented Sep 11, 2024 •

edited

Loading

September 2024 release #362

September 2024 release #362

Conversation

whitead commented Sep 11, 2024 • edited Loading

New Features

Improvements

Regressions/Deprecation

Detailed Changes:

New Contributors

whitead commented Sep 11, 2024 •

edited

Loading