Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

September 2024 release #362

Merged
merged 55 commits into from
Sep 11, 2024
Merged

September 2024 release #362

merged 55 commits into from
Sep 11, 2024

Conversation

whitead
Copy link
Collaborator

@whitead whitead commented Sep 11, 2024

This comment is a draft for release notes below (this will be a merge commit)

New Features

  • Automatic population of metadata: PDF metadata is automatically retrieved from a variety of providers, including adding bibtex, citation counts, journal quality assessments, and noting retractions
  • full-text search: A major difference between our published work and this repo is ability to search over all of scientific literature. We've brought the OSS version closer by adding full-text keyword search via tantivy. Now you can index and search many papers before embdding, making it feasible to ingest many papers.
  • unified settings management: You can now save/load settings and that makes it easier for us to distribute settings reflecting various tasks with PaperQA2. Examples are writing wikipedia articles, identifying contradictions, and obtaining structured data
  • CLI: We've made a CLI that uses persistent parsings/indexes and makes it much easier to just ask questions of a folder of PDFs
  • Litellm: We've adopted litellm as the LLM wrapper of choice. This means we now support many LLM APIs directly with only the model string changing. It also means we have "routers" now that can do fallbacks, api rate limiting, and retries.

Improvements

  • More modern agent frameworks
  • Reduction in dependencies
  • Removed code duplicated by litellm
  • Many improvements on code style and best practices

Regressions/Deprecation

We've removed the following features to keep our library focused:

  • doc_match - we do not have enough data to support that this method actually helps for very large corpuses
  • LangchainVectorStore - We no longer support more complex vector stores via Langchain like FAISS. Instead, we only support Numpy vector stores. We never found the paradigm of very large vector stores to be better than keyword search -> vector search -> LLM reranking and thus removed the code

Detailed Changes:

New Contributors

Full Changelog: v4.9.0...vnew

mskarlin and others added 30 commits August 29, 2024 11:24
* add agentic workflow module and cli

* update pyproject.toml and split google into its own import

* remove unused import

* update default search chain

* Update __init__.py

* Update search.py

* move reqs into pyproject.toml, addressing PR comments

* rename search_chain to openai_get_search_query

* remove get_llm_name

* move table_formatter to helpers.py

* Update paperqa/agents/main.py

Co-authored-by: James Braza <[email protected]>

* Update paperqa/agents/main.py

Co-authored-by: James Braza <[email protected]>

* Update paperqa/agents/main.py

Co-authored-by: James Braza <[email protected]>

* Update paperqa/agents/docs.py

Co-authored-by: James Braza <[email protected]>

* Update paperqa/types.py

Co-authored-by: James Braza <[email protected]>

* rename compute_cost to compute_total_model_token_cost

* remove stream_answer

* rename to stub_manifest, and use Path for all paths

* Update paperqa/llms.py

Co-authored-by: James Braza <[email protected]>

* move SKIP_AGENT_TESTS = False

* nix _ = assignments

* add test comments

* types in conftest.py

* split libs into llms

* link openai chat timeout to query.timeout

* Update paperqa/agents/__init__.py

Co-authored-by: James Braza <[email protected]>

* logging revamp and renaming

* Update tests/test_cli.py

Co-authored-by: James Braza <[email protected]>

* Update tests/test_cli.py

Co-authored-by: James Braza <[email protected]>

* move vertex import to func call, add docstring to SupportsPickle

* docstring

* remove _ =

* remove bool return type from set

* update gitignore

* add config attribute to baase LLMModel class

* replace get_current_settings -> get_settings

* replace get_current_settings -> get_settings

* PR simplifications

* remove all stream_* functions

* avoid modifying the root logger

* re-organize logger import location

* move hashlib into utils

* refactor strip_answer into Answer object

* label circular imports

* ensure absolute paths are used in index name

* limit select to be used only when DOI is not present in crossref

* Update paperqa/agents/search.py

Co-authored-by: James Braza <[email protected]>

* Update paperqa/agents/search.py

Co-authored-by: James Braza <[email protected]>

* Update paperqa/agents/search.py

Co-authored-by: James Braza <[email protected]>

* Update paperqa/agents/search.py

Co-authored-by: James Braza <[email protected]>

* Update paperqa/agents/models.py

Co-authored-by: James Braza <[email protected]>

* reconfigure logging to not prevent propagation

* remove newlines in the current year

* use required fields as a subset

* replace . with Path.cwd()

---------

Co-authored-by: James Braza <[email protected]>
* add unpaywall provider

* remove unused clean query method, update test cassettes to use [email protected]
…tests (#311)

* making SearchIndex.fields a list

* replace ids with md5 sum

* rename stub for test_agent_sharing_state

* run github actions on PRs into release branches too

* ruff formatting

* remove nested quotes for <3.12

* added staging data into tests; reimplemented tests to avoid downloads and keep consistent hashes; aded env vars to test.yaml

* conftest removes extension from near match filenames

* add match_on for sequential runs since they may be out of order

* refresh casettes

* replace deprecated pytest-vcr with pytest-recording

* change matching criteria

* set unpaywall email for CI

* try explicit conditions on cassette files, set the cassette dir

* removed vcr for sequential client tests

* add fixture to ensure that log levels are reset

* added cassettes back in for sequential tests after determining it was a log level failure

* limit cli logger mutation to avoid clobbering other library defaults

---------

Co-authored-by: Siddharth Narayanan <[email protected]>
Large refactor to decouple Docs, LLMs, and Config.

    Moves to centralized config that can be loaded from package data or defaults. This changes many function calls so that they pass config as an object or name of a config object.
    Removed doc_match and doc_index - these were only sometimes useful and not really part of current usage. We can add back if needed, but just complicated things.
    Removed the unnecessary complicated get_callbacks factories. Now, callbacks can have a kwarg of name to get access to name of chain being called.
    Switched to contextvars for setting answer_id in LLMResults so that we do not need to have so much back and forth for callbacks.
    Deferred updates to Answer objects until end of functions so that retrying is possible (except token counts)
    Generally moved all config out of Docs, except LLMModel which will be fixed in port to litellm



Co-authored-by: Michael Skarlinski <[email protected]>
Co-authored-by: James Braza <[email protected]>
Co-authored-by: mskarlin <[email protected]>
I accidentally put an extra struct=True and broke everything
Co-authored-by: Michael Skarlinski <[email protected]>
Co-authored-by: James Braza <[email protected]>
Check if a journal name starts with  `the`
@whitead whitead merged commit 45b206d into main Sep 11, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants