Skip to content

Commit 3cbc4dd

Browse files
authored
Usable "getting started" docs in README, both interpreter and Jupyter (#953)
1 parent 38b3fe5 commit 3cbc4dd

File tree

4 files changed

+61
-52
lines changed

4 files changed

+61
-52
lines changed

README.md

Lines changed: 32 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,10 @@ and finally answer the user question with an LLM agent.
6363

6464
```bash
6565
pip install paper-qa
66+
mkdir my_papers
67+
curl -o my_papers/PaperQA2.pdf https://arxiv.org/pdf/2409.13740
6668
cd my_papers
67-
pqa ask 'How can carbon nanotubes be manufactured at a large scale?'
69+
pqa ask 'What is PaperQA2?'
6870
```
6971

7072
### Example Output
@@ -181,7 +183,7 @@ Those can be exported as `CROSSREF_API_KEY` and `SEMANTIC_SCHOLAR_API_KEY` varia
181183
The fastest way to test PaperQA2 is via the CLI. First navigate to a directory with some papers and use the `pqa` cli:
182184

183185
```bash
184-
$ pqa ask 'What manufacturing challenges are unique to bispecific antibodies?'
186+
pqa ask 'What is PaperQA2?'
185187
```
186188

187189
You will see PaperQA2 index your local PDF files, gathering the necessary metadata for each of them (using [Crossref](https://www.crossref.org/) and [Semantic Scholar](https://www.semanticscholar.org/)),
@@ -190,13 +192,13 @@ search over that index, then break the files into chunked evidence contexts, ran
190192
All prior answers will be indexed and stored, you can view them by querying via the `search` subcommand, or access them yourself in your `PQA_HOME` directory, which defaults to `~/.pqa/`.
191193

192194
```bash
193-
$ pqa search -i 'answers' 'antibodies'
195+
pqa -i 'answers' search 'ranking and contextual summarization'
194196
```
195197

196198
PaperQA2 is highly configurable, when running from the command line, `pqa --help` shows all options and short descriptions. For example to run with a higher temperature:
197199

198200
```bash
199-
$ pqa --temperature 0.5 ask 'What manufacturing challenges are unique to bispecific antibodies?'
201+
pqa --temperature 0.5 ask 'What is PaperQA2?'
200202
```
201203

202204
You can view all settings with `pqa view`. Another useful thing is to change to other templated settings - for example `fast` is a setting that answers more quickly and you can see it with `pqa -s fast view`
@@ -210,13 +212,13 @@ pqa -s my_new_settings --temperature 0.5 --llm foo-bar-5 save
210212
and then you can use it with
211213

212214
```bash
213-
pqa -s my_new_settings ask 'What manufacturing challenges are unique to bispecific antibodies?'
215+
pqa -s my_new_settings ask 'What is PaperQA2?'
214216
```
215217

216218
If you run `pqa` with a command which requires a new indexing, say if you change the default chunk_size, a new index will automatically be created for you.
217219

218220
```bash
219-
pqa --parsing.chunk_size 5000 ask 'What manufacturing challenges are unique to bispecific antibodies?'
221+
pqa --parsing.chunk_size 5000 ask 'What is PaperQA2?'
220222
```
221223

222224
You can also use `pqa` to do full-text search with use of LLMs view the search command. For example, let's save the index from a directory and give it a name:
@@ -262,7 +264,7 @@ If you are hitting rate limits, say with the OpenAI Tier 1 plan, you can add the
262264
For each OpenAI tier, a pre-built setting exists to limit usage.
263265

264266
```bash
265-
pqa --settings 'tier1_limits' ask 'Are there nm scale features in thermoelectric materials?'
267+
pqa --settings 'tier1_limits' ask 'What is PaperQA2?'
266268
```
267269

268270
This will limit your system to use the [tier1_limits](paperqa/configs/tier1_limits.json),
@@ -271,7 +273,7 @@ and slow down your queries to accommodate.
271273
You can also specify them manually with any rate limit string that matches the specification in the [limits](https://limits.readthedocs.io/en/stable/quickstart.html#rate-limit-string-notation) module:
272274

273275
```bash
274-
pqa --summary_llm_config '{"rate_limit": {"gpt-4o-2024-11-20": "30000 per 1 minute"}}' ask 'Are there nm scale features in thermoelectric materials?'
276+
pqa --summary_llm_config '{"rate_limit": {"gpt-4o-2024-11-20": "30000 per 1 minute"}}' ask 'What is PaperQA2?'
275277
```
276278

277279
Or by adding into a `Settings` object, if calling imperatively:
@@ -280,7 +282,7 @@ Or by adding into a `Settings` object, if calling imperatively:
280282
from paperqa import Settings, ask
281283

282284
answer_response = ask(
283-
"What manufacturing challenges are unique to bispecific antibodies?",
285+
"What is PaperQA2?",
284286
settings=Settings(
285287
llm_config={"rate_limit": {"gpt-4o-2024-11-20": "30000 per 1 minute"}},
286288
summary_llm_config={"rate_limit": {"gpt-4o-2024-11-20": "30000 per 1 minute"}},
@@ -296,7 +298,7 @@ PaperQA2's full workflow can be accessed via Python directly:
296298
from paperqa import Settings, ask
297299

298300
answer_response = ask(
299-
"What manufacturing challenges are unique to bispecific antibodies?",
301+
"What is PaperQA2?",
300302
settings=Settings(temperature=0.5, paper_directory="my_papers"),
301303
)
302304
```
@@ -312,7 +314,7 @@ The answer object has the following attributes: `formatted_answer`, `answer` (an
312314
from paperqa import Settings, ask
313315

314316
answer_response = ask(
315-
"What manufacturing challenges are unique to bispecific antibodies?",
317+
"What is PaperQA2?",
316318
settings=Settings(temperature=0.5, paper_directory="my_papers"),
317319
)
318320
```
@@ -323,7 +325,7 @@ answer_response = ask(
323325
from paperqa import Settings, agent_query
324326

325327
answer_response = await agent_query(
326-
query="What manufacturing challenges are unique to bispecific antibodies?",
328+
query="What is PaperQA2?",
327329
settings=Settings(temperature=0.5, paper_directory="my_papers"),
328330
)
329331
```
@@ -358,10 +360,7 @@ settings.llm = "claude-3-5-sonnet-20240620"
358360
settings.answer.answer_max_sources = 3
359361

360362
# Query the Docs object to get an answer
361-
session = await docs.aquery(
362-
"What manufacturing challenges are unique to bispecific antibodies?",
363-
settings=settings,
364-
)
363+
session = await docs.aquery("What is PaperQA2?", settings=settings)
365364
print(session)
366365
```
367366

@@ -394,9 +393,7 @@ async def main() -> None:
394393
for doc in ("myfile.pdf", "myotherfile.pdf"):
395394
await docs.aadd(doc)
396395

397-
session = await docs.aquery(
398-
"What manufacturing challenges are unique to bispecific antibodies?"
399-
)
396+
session = await docs.aquery("What is PaperQA2?")
400397
print(session)
401398

402399

@@ -421,7 +418,7 @@ You can adjust this easily to use any model supported by `litellm`:
421418
from paperqa import Settings, ask
422419

423420
answer_response = ask(
424-
"What manufacturing challenges are unique to bispecific antibodies?",
421+
"What is PaperQA2?",
425422
settings=Settings(
426423
llm="gpt-4o-mini", summary_llm="gpt-4o-mini", paper_directory="my_papers"
427424
),
@@ -437,7 +434,7 @@ from paperqa import Settings, ask
437434
from paperqa.settings import AgentSettings
438435

439436
answer_response = ask(
440-
"What manufacturing challenges are unique to bispecific antibodies?",
437+
"What is PaperQA2?",
441438
settings=Settings(
442439
llm="claude-3-5-sonnet-20240620",
443440
summary_llm="claude-3-5-sonnet-20240620",
@@ -455,7 +452,7 @@ from paperqa import Settings, ask
455452
from paperqa.settings import AgentSettings
456453

457454
answer_response = ask(
458-
"What manufacturing challenges are unique to bispecific antibodies?",
455+
"What is PaperQA2?",
459456
settings=Settings(
460457
llm="gemini/gemini-2.0-flash",
461458
summary_llm="gemini/gemini-2.0-flash",
@@ -491,7 +488,7 @@ local_llm_config = dict(
491488
)
492489

493490
answer_response = ask(
494-
"What manufacturing challenges are unique to bispecific antibodies?",
491+
"What is PaperQA2?",
495492
settings=Settings(
496493
llm="my-llm-model",
497494
llm_config=local_llm_config,
@@ -520,7 +517,7 @@ local_llm_config = {
520517
}
521518

522519
answer_response = ask(
523-
"What manufacturing challenges are unique to bispecific antibodies?",
520+
"What is PaperQA2?",
524521
settings=Settings(
525522
llm="ollama/llama3.2",
526523
llm_config=local_llm_config,
@@ -549,7 +546,7 @@ The simplest way to specify the embedding model is via `Settings.embedding`:
549546
from paperqa import Settings, ask
550547

551548
answer_response = ask(
552-
"What manufacturing challenges are unique to bispecific antibodies?",
549+
"What is PaperQA2?",
553550
settings=Settings(embedding="text-embedding-3-large"),
554551
)
555552
```
@@ -606,7 +603,7 @@ and then prefix embedding model names with `st-`:
606603
from paperqa import Settings, ask
607604

608605
answer_response = ask(
609-
"What manufacturing challenges are unique to bispecific antibodies?",
606+
"What is PaperQA2?",
610607
settings=Settings(embedding="st-multi-qa-MiniLM-L6-cos-v1"),
611608
)
612609
```
@@ -617,7 +614,7 @@ or with a hybrid model
617614
from paperqa import Settings, ask
618615

619616
answer_response = ask(
620-
"What manufacturing challenges are unique to bispecific antibodies?",
617+
"What is PaperQA2?",
621618
settings=Settings(embedding="hybrid-st-multi-qa-MiniLM-L6-cos-v1"),
622619
)
623620
```
@@ -634,7 +631,7 @@ settings.answer.answer_max_sources = 3
634631
settings.answer.k = 5
635632

636633
await docs.aquery(
637-
"What manufacturing challenges are unique to bispecific antibodies?",
634+
"What is PaperQA2?",
638635
settings=settings,
639636
)
640637
```
@@ -653,7 +650,9 @@ source_files = glob.glob("**/*.js")
653650
docs = Docs()
654651
for f in source_files:
655652
# this assumes the file names are unique in code
656-
await docs.aadd(f, citation="File " + os.path.basename(f), docname=os.path.basename(f))
653+
await docs.aadd(
654+
f, citation="File " + os.path.basename(f), docname=os.path.basename(f)
655+
)
657656
session = await docs.aquery("Where is the search bar in the header defined?")
658657
print(session)
659658
```
@@ -726,11 +725,11 @@ async def amain(folder_of_papers: str | os.PathLike) -> None:
726725

727726
# 2. Use the settings as many times as you want with ask
728727
answer_response_1 = await agent_query(
729-
query="What is the best way to make a vaccine?",
728+
query="What is a cool retrieval augmented generation technique?",
730729
settings=settings,
731730
)
732731
answer_response_2 = await agent_query(
733-
query="What manufacturing challenges are unique to bispecific antibodies?",
732+
query="What is PaperQA2?",
734733
settings=settings,
735734
)
736735
```
@@ -863,10 +862,7 @@ docs = Docs()
863862

864863
# add some docs...
865864

866-
await docs.aquery(
867-
"What manufacturing challenges are unique to bispecific antibodies?",
868-
callbacks=[typewriter],
869-
)
865+
await docs.aquery("What is PaperQA2?", callbacks=[typewriter])
870866
```
871867

872868
### Caching Embeddings
@@ -892,7 +888,7 @@ my_qa_prompt = (
892888
docs = Docs()
893889
settings = Settings()
894890
settings.prompts.qa = my_qa_prompt
895-
await docs.aquery("Are covid-19 vaccines effective?", settings=settings)
891+
await docs.aquery("What is PaperQA2?", settings=settings)
896892
```
897893

898894
### Pre and Post Prompts

paperqa/agents/__init__.py

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from __future__ import annotations
22

33
import argparse
4+
import asyncio
45
import logging
56
import os
67
from pathlib import Path
@@ -11,7 +12,7 @@
1112
from rich.logging import RichHandler
1213

1314
from paperqa.settings import Settings, get_settings
14-
from paperqa.utils import get_loop, pqa_directory, setup_default_logs
15+
from paperqa.utils import pqa_directory, run_or_ensure, setup_default_logs
1516
from paperqa.version import __version__
1617

1718
from .main import agent_query, index_search
@@ -98,25 +99,30 @@ def configure_cli_logging(verbosity: int | Settings = 0) -> None:
9899
print(f"PaperQA version: {__version__}")
99100

100101

101-
def ask(query: str | MultipleChoiceQuestion, settings: Settings) -> AnswerResponse:
102+
def ask(
103+
query: str | MultipleChoiceQuestion, settings: Settings
104+
) -> AnswerResponse | asyncio.Task[AnswerResponse]:
102105
"""Query PaperQA via an agent."""
103106
configure_cli_logging(settings)
104-
return get_loop().run_until_complete(
105-
agent_query(query, settings, agent_type=settings.agent.agent_type)
107+
return run_or_ensure(
108+
coro=agent_query(query, settings, agent_type=settings.agent.agent_type)
106109
)
107110

108111

109112
def search_query(
110113
query: str | MultipleChoiceQuestion,
111114
index_name: str,
112115
settings: Settings,
113-
) -> list[tuple[AnswerResponse, str] | tuple[Any, str]]:
116+
) -> (
117+
list[tuple[AnswerResponse, str] | tuple[Any, str]]
118+
| asyncio.Task[list[tuple[AnswerResponse, str] | tuple[Any, str]]]
119+
):
114120
"""Search using a pre-built PaperQA index."""
115121
configure_cli_logging(settings)
116122
if index_name == "default":
117123
index_name = settings.get_index_name()
118-
return get_loop().run_until_complete(
119-
index_search(
124+
return run_or_ensure(
125+
coro=index_search(
120126
query if isinstance(query, str) else query.question_prompt,
121127
index_name=index_name,
122128
index_directory=settings.agent.index.index_directory,
@@ -128,7 +134,7 @@ def build_index(
128134
index_name: str | None = None,
129135
directory: str | os.PathLike | None = None,
130136
settings: Settings | None = None,
131-
) -> SearchIndex:
137+
) -> SearchIndex | asyncio.Task[SearchIndex]:
132138
"""Build a PaperQA search index, this will also happen automatically upon using `ask`."""
133139
settings = get_settings(settings)
134140
if index_name == "default":
@@ -138,7 +144,7 @@ def build_index(
138144
configure_cli_logging(settings)
139145
if directory:
140146
settings.agent.index.paper_directory = directory
141-
return get_loop().run_until_complete(get_directory_index(settings=settings))
147+
return run_or_ensure(coro=get_directory_index(settings=settings))
142148

143149

144150
def save_settings(settings: Settings, settings_path: str | os.PathLike) -> None:

paperqa/utils.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
import re
1111
import string
1212
import unicodedata
13-
from collections.abc import Collection, Iterable, Iterator
13+
from collections.abc import Awaitable, Collection, Iterable, Iterator
1414
from datetime import datetime
1515
from functools import reduce
1616
from http import HTTPStatus
@@ -215,6 +215,14 @@ def get_loop() -> asyncio.AbstractEventLoop:
215215
return loop
216216

217217

218+
def run_or_ensure(coro: Awaitable[T]) -> T | asyncio.Task[T]:
219+
"""Run a coroutine or convert to a future if an event loop is running."""
220+
loop = get_loop()
221+
if loop.is_running(): # In async contexts (e.g., Jupyter notebook), return a Task
222+
return asyncio.ensure_future(coro)
223+
return loop.run_until_complete(coro)
224+
225+
218226
def encode_id(value: str | bytes | UUID, maxsize: int | None = 16) -> str:
219227
"""Encode a value (e.g. a DOI) optionally with a max length."""
220228
if isinstance(value, UUID):

tests/test_cli.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,11 @@
77
from tenacity import Retrying, retry_if_exception_type, stop_after_attempt
88

99
from paperqa import Docs
10+
from paperqa.agents import ask, build_index, main, search_query
11+
from paperqa.agents.models import AnswerResponse
1012
from paperqa.settings import Settings
1113
from paperqa.utils import pqa_directory
1214

13-
try:
14-
from paperqa.agents import ask, build_index, main, search_query
15-
from paperqa.agents.models import AnswerResponse
16-
except ImportError:
17-
pytest.skip("agents module is not installed", allow_module_level=True)
18-
1915

2016
def test_can_modify_settings(capsys, stub_data_dir: Path) -> None:
2117
rel_path_home_to_stub_data = Path("~") / stub_data_dir.relative_to(Path.home())
@@ -58,13 +54,15 @@ def test_cli_ask(agent_index_dir: Path, stub_data_dir: Path) -> None:
5854
response = ask(
5955
"How can you use XAI for chemical property prediction?", settings=settings
6056
)
57+
assert isinstance(response, AnswerResponse)
6158
assert response.session.formatted_answer
6259

6360
search_result = search_query(
6461
" ".join(response.session.formatted_answer.split()),
6562
"answers",
6663
settings,
6764
)
65+
assert isinstance(search_result, list)
6866
found_answer = search_result[0][0]
6967
assert isinstance(found_answer, AnswerResponse)
7068
assert found_answer.model_dump() == response.model_dump()
@@ -86,6 +84,7 @@ def test_cli_can_build_and_search_index(
8684
with attempt:
8785
build_index(index_name, stub_data_dir, settings)
8886
result = search_query("XAI", index_name, settings)
87+
assert isinstance(result, list)
8988
assert len(result) == 1
9089
assert isinstance(result[0][0], Docs)
9190
assert all(d.startswith("Wellawatte") for d in result[0][0].docnames)

0 commit comments

Comments
 (0)