-
Notifications
You must be signed in to change notification settings - Fork 781
Feature/new doc types #1169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/new doc types #1169
Conversation
|
Documentation Updates 1 document(s) were updated by changes in this PR: paper-qa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
Bug: Office Docs Missing Chunking Logic, Failing Processing
Missing chunking logic for office documents (.docx, .xlsx, .pptx). The parse_office_doc function (lines 176-220) creates ParsedText with content as a dict (line 209), but there's no corresponding chunking case in read_doc. Office documents will fall through to the else clause (line 499) which uses chunk_code_text. However, chunk_code_text expects content to be str|list (lines 287-290) and will raise NotImplementedError when given a dict, causing office document processing to fail during chunking.
src/paperqa/readers.py#L454-L505
paper-qa/src/paperqa/readers.py
Lines 454 to 505 in 95e6e25
| if chunk_chars == 0: | |
| chunked_text = [ | |
| Text(text=parsed_text.reduce_content(), name=doc.docname, doc=doc) | |
| ] | |
| chunk_metadata = ChunkMetadata( | |
| size=0, | |
| overlap=0, | |
| name=( | |
| f"paper-qa={pqa_version}|algorithm=none" | |
| f"|reduction=cl100k_base{enrichment_summary}" | |
| ), | |
| ) | |
| elif str_path.endswith(".pdf"): | |
| chunked_text = chunk_pdf( | |
| parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap | |
| ) | |
| chunk_metadata = ChunkMetadata( | |
| size=chunk_chars, | |
| overlap=overlap, | |
| name=( | |
| f"paper-qa={pqa_version}|algorithm=overlap-pdf" | |
| f"|size={chunk_chars}|overlap={overlap}{enrichment_summary}" | |
| ), | |
| ) | |
| elif str_path.endswith(IMAGE_EXTENSIONS): | |
| chunked_text = chunk_pdf( | |
| parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap | |
| ) | |
| chunk_metadata = ChunkMetadata( | |
| size=0, | |
| overlap=0, | |
| name=f"paper-qa={pqa_version}|algorithm=none{enrichment_summary}", | |
| ) | |
| elif str_path.endswith((".txt", ".html")): | |
| chunked_text = chunk_text( | |
| parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap | |
| ) | |
| chunk_metadata = ChunkMetadata( | |
| size=chunk_chars, | |
| overlap=overlap, | |
| name=( | |
| f"paper-qa={pqa_version}|algorithm=overlap-text|reduction=cl100k_base" | |
| f"|size={chunk_chars}|overlap={overlap}{enrichment_summary}" | |
| ), | |
| ) | |
| else: | |
| chunked_text = chunk_code_text( | |
| parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap | |
| ) | |
| chunk_metadata = ChunkMetadata( | |
| size=chunk_chars, |
Comment @cursor review or bugbot run to trigger another review on this PR
src/paperqa/readers.py
Outdated
| media_index += 1 | ||
| elif isinstance(el, Table): | ||
| # For tables, we could get the HTML representation for better structure | ||
| current_text += el.metadata.text_as_html + "\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Null Text Concatenation Crash in HTML Rendering
Potential TypeError when el.metadata.text_as_html is None. The code concatenates el.metadata.text_as_html with a string without checking if it's None first. If the unstructured library returns None for text_as_html on certain Table elements, this will raise a TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'.
src/paperqa/readers.py
Outdated
| from unstructured.documents.elements import Image, Table | ||
| from unstructured.partition.auto import partition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look around the repo for other examples of lazy imports, let's just lazily import these in the first line of parse_office_doc:
try:
from unstructured.documents.elements import Image, Table
from unstructured.partition.auto import partition
except ImportError as exc:
raise ImportError(
"TODO some message mentioning to install `paper-qa[office]`"
) from exc
src/paperqa/readers.py
Outdated
| return ParsedText( | ||
| content=content_dict, | ||
| metadata=ParsedMetadata( | ||
| parsing_libraries=["unstructured"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include the version of unstructured here? Take a look at how the Docling reader does it (at packages/paper-qa-docling)
| ) | ||
|
|
||
|
|
||
| def parse_office_doc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a unit test for this in test_paperqa.py? Feel free to use another LLM besides OpenAI (e.g. Anthropic, OpenRouter) for your testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[success] 90.64% tests/test_paperqa.py::test_parse_office_doc[dummy.docx]: 1.5548s
[success] 6.33% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx]: 0.1086s
[success] 3.03% tests/test_paperqa.py::test_parse_office_doc[dummy.pptx]: 0.0520s
Results (5.19s):
3 passed
I'm not confident, but the test passed. I'll commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And sorry for dummy .docx and .xlsx written in Japanese.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, excited about this!
|
Sorry for that I did not testing chunk .docx,.pptx,.xlsx. If you give me some time, I'll add the new code to tests/test_paperqa.py, run the tests, and then commit it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, thanks again for the contribution so far, it's great
Include the version of the 'unstructured' library in the parsing_libraries metadata for office document parsing. This provides better traceability and debugging information for parsed documents.
Moves unstructured library imports into the parse_office_doc function to enable lazy loading. This improves application startup time and allows users to avoid installing unstructured dependencies unless they are processing office documents. An ImportError is now raised with instructions if the necessary dependencies are not found. Also, the unstructured version is now dynamically captured within the function for metadata.
- Add a unit test for to verify parsing of .docx, .pptx, and .xlsx files.
- Add dummy office files to for testing purposes.
- Update test configuration to use OpenRouter and a non-OpenAI embedding model to avoid authentication
issues during testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the .mailmap issue in pre-commit? Then this PR should be ready
tests/test_paperqa.py
Outdated
| llm="openrouter/google/gemma-7b-it", | ||
| llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")}, | ||
| parsing=ParsingSettings( | ||
| use_doc_details=False, disable_doc_valid_check=True, defer_embedding=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| use_doc_details=False, disable_doc_valid_check=True, defer_embedding=True | |
| use_doc_details=False, disable_doc_valid_check=True |
I think you don't need to defer embeddings, we can embed right away
tests/test_paperqa.py
Outdated
| docs = Docs() | ||
| settings = Settings( | ||
| llm="openrouter/google/gemma-7b-it", | ||
| llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")}, |
I think you shouldn't need this, litellm should just auto check OPENROUTER_API_KEY: https://docs.litellm.ai/docs/providers/openrouter
So maybe update your local env to have OPENROUTER_API_KEY
Refactor test_parse_office_doc to utilize Gemini models for both LLM and embedding,
ensuring compatibility and leveraging Gemini's capabilities.
- Configured , , , and to use
gemini/gemini-2.5-flash and gemini/text-embedding-004 respectively.
- Removed explicit and as Gemini models
are expected to pick up API keys from environment variables.
- Added a call with a specific question to verify the RAG system's
functionality after document addition.
- Ensured is not set, allowing immediate embedding.
…shima/paper-qa into feature/new-doc-types
|
[success] 46.95% tests/test_paperqa.py::test_parse_office_doc[dummy.docx]: 8.4559s Results (20.60s): I`m still not confident, but test is passed. |
tests/test_paperqa.py
Outdated
| embedding="gemini/text-embedding-004", | ||
| # 他のLLM設定も明示的に指定 | ||
| summary_llm="gemini/gemini-2.5-flash", # サマリー用 | ||
| agent_llm="gemini/gemini-2.5-flash", # エージェント用 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| embedding="gemini/text-embedding-004", | |
| # 他のLLM設定も明示的に指定 | |
| summary_llm="gemini/gemini-2.5-flash", # サマリー用 | |
| agent_llm="gemini/gemini-2.5-flash", # エージェント用 | |
| embedding="gemini/text-embedding-004", | |
| summary_llm="gemini/gemini-2.5-flash", | |
| agent={"agent_llm": "gemini/gemini-2.5-flash"}, |
No need for these comments, and also agent_llm is one level down
tests/test_paperqa.py
Outdated
| file_path, | ||
| "dummy citation", | ||
| docname=filename, | ||
| dockey="dummy_doc", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dockey="dummy_doc", |
No need to specify dockey if you're not going to use it
tests/test_paperqa.py
Outdated
| settings=settings, | ||
| ) | ||
| assert docname is not None | ||
| assert len(docs.texts) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you look at lint CI to fix this part
This commit fixes several linting errors and improves code clarity in
:
- Removed unnecessary comments and parameter in
.
- Corrected the configuration.
- Replaced with to address FURB115
warning.
|
[success] 47.26% tests/test_paperqa.py::test_parse_office_doc[dummy.docx]: 8.8184s Results (22.94s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very close now, just need the unit test to be good
tests/test_paperqa.py
Outdated
| session = await docs.aquery("What is the RAG system?", settings=settings) | ||
| assert session.answer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I actually ran these tests just now a bit, and I noticed the question "What is the RAG system?" only applies to the dummy.docx.
Can you either:
- Adjust the question to match each document
- Change the
.pptxandxlsxto also have content for"What is the RAG system?"
Let's make the assertions:
session = await docs.aquery("What is the RAG system?", settings=settings)
assert session.used_contexts
assert len(session.answer) > 10, "Expected an answer"
assert CANNOT_ANSWER_PHRASE not in session.answer, (
"Expected the system to be sure"
)
tests/test_paperqa.py
Outdated
| embedding="gemini/text-embedding-004", | ||
| summary_llm="gemini/gemini-2.5-flash", | ||
| agent={"agent_llm": "gemini/gemini-2.5-flash"}, | ||
| parsing=ParsingSettings(use_doc_details=False, disable_doc_valid_check=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| parsing=ParsingSettings(use_doc_details=False, disable_doc_valid_check=True), | |
| parsing=ParsingSettings(use_doc_details=False), |
These docs should be valid (we don't need disable_doc_valid_check=True)
|
It seems the entropy calculation for the maybe_is_text() function didn't work properly on Japanese text, so a multilingual fix might be needed. This time, converting the .docx to English worked as a workaround and helped identify the cause. I can make adjustments if needed. What would you like to do? |
I think this PR is ready to go, I was going to merge in the morning after a re-read. Let's do multilingual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @takeruhukushima much appreciated
|
@jamesbraza |
|
Oh we'll cut a release in a week or do. In the meantime, just do this to install latest pip install git+https://github.com/Future-House/paper-qa.git |
As a non-paying user, I cannot use the OpenAI API and have been unable to conduct testing. I apologize for this.
If the described function should be in the packages directory, please let me know and I'll recreate the branch.
Note
Adds .docx/.xlsx/.pptx parsing (text, tables, images) using Unstructured and wires it into reading/enrichment with a new
officeoptional dependency.parse_office_docusingunstructured.partition.auto.partition, extracting text, tables (HTML), and images toParsedText/ParsedMedia.Image,Tablefromunstructuredand integrate into parsing loop.ENRICHMENT_EXTENSIONSto include".docx", ".xlsx", ".pptx".read_docto route*.docx/*.xlsx/*.pptxtoparse_office_doc(threaded), preserving chunking/enrichment behavior.officewithunstructured[docx,xlsx,pptx]and include it indevextras.unstructuredand required transitive deps; add platform markers for some packages.Written by Cursor Bugbot for commit 95e6e25. Configure here.