Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting and cleaning up manifest file logic #448

Merged
merged 3 commits into from
Sep 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 32 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ question answering, summarization, and contradiction detection.
- [What's New in Version 5 (aka PaperQA2)?](#whats-new-in-version-5-aka-paperqa2)
- [PaperQA2 Algorithm](#paperqa2-algorithm)
- [Installation](#installation)
- [CLI Usage](#cli-usage)
- [Bundled Settings](#bundled-settings)
- [CLI Usage](#cli-usage)
- [Bundled Settings](#bundled-settings)
- [Library Usage](#library-usage)
- [`ask` manually](#ask-manually)
- [Adding Documents Manually](#adding-documents-manually)
Expand All @@ -30,6 +30,8 @@ question answering, summarization, and contradiction detection.
- [Adjusting number of sources](#adjusting-number-of-sources)
- [Using Code or HTML](#using-code-or-html)
- [Using External DB/Vector DB and Caching](#using-external-dbvector-db-and-caching)
- [Creating Index](#creating-index)
- [Manifest Files](#manifest-files)
- [Reusing Index](#reusing-index)
- [Running on LitQA v2](#running-on-litqa-v2)
- [Using Clients Directly](#using-clients-directly)
Expand Down Expand Up @@ -169,7 +171,7 @@ you will likely want an API key for both [Crossref](https://www.crossref.org/doc
which will allow you to avoid hitting public rate limits using these metadata services.
Those can be exported as `CROSSREF_API_KEY` and `SEMANTIC_SCHOLAR_API_KEY` variables.

### CLI Usage
## CLI Usage

The fastest way to test PaperQA2 is via the CLI. First navigate to a directory with some papers and use the `pqa` cli:

Expand Down Expand Up @@ -236,7 +238,7 @@ Both the CLI and module have pre-configured settings based on prior performance
pqa --settings <setting name> ask 'Are there nm scale features in thermoelectric materials?'
```

#### Bundled Settings
### Bundled Settings

Inside [`paperqa/configs`](paperqa/configs) we bundle known useful settings:

Expand Down Expand Up @@ -524,6 +526,32 @@ for ... in my_docs:
docs.add_texts(texts, doc)
```

### Creating Index

Indexes will be placed in the [home directory][home dir] by default.
This can be controlled via the `PQA_HOME` environment variable.

Indexes are made by reading files in the `Settings.paper_directory`.
By default, we recursively read from subdirectories of the paper directory,
unless disabled using `Settings.index_recursively`.
The paper directory is not modified in any way, it's just read from.

[home dir]: https://docs.python.org/3/library/pathlib.html#pathlib.Path.home

#### Manifest Files

The indexing process attempts to infer paper metadata like title and DOI
using LLM-powered text processing.
You can avoid this point of uncertainty using a "manifest" file,
which is a CSV containing three columns (order doesn't matter):

- `file_location`: relative path to the paper's PDF within the index directory
- `doi`: DOI of the paper
- `title`: title of the paper

By providing this information,
we ensure queries to metadata providers like Crossref are accurate.

### Reusing Index

The local search indexes are built based on a hash of the current `Settings` object.
Expand Down
9 changes: 5 additions & 4 deletions paperqa/agents/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -333,16 +333,17 @@ async def query(
]


async def maybe_get_manifest(filename: anyio.Path | None) -> dict[str, DocDetails]:
async def maybe_get_manifest(
filename: anyio.Path | None = None,
) -> dict[str, DocDetails]:
if not filename:
return {}
if filename.suffix == ".csv":
try:
async with await anyio.open_file(filename, mode="r") as file:
content = await file.read()
reader = csv.DictReader(StringIO(content))
records = [DocDetails(**row) for row in reader]
return {str(r.file_location): r for r in records if r.file_location}
records = [DocDetails(**row) for row in csv.DictReader(StringIO(content))]
return {str(r.file_location): r for r in records if r.file_location}
except FileNotFoundError:
logging.warning(f"Manifest file at {filename} could not be found.")
except Exception:
Expand Down
2 changes: 1 addition & 1 deletion paperqa/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,7 +432,7 @@ class Settings(BaseSettings):
default=None,
description=(
"Optional manifest CSV, containing columns which are attributes for a"
" DocDetails object. Only 'file_location','doi', and 'title' will be used"
" DocDetails object. Only 'file_location', 'doi', and 'title' will be used"
" when indexing."
),
)
Expand Down
Loading