Feature/new doc types #1169

takeruhukushima · 2025-11-01T23:23:35Z

As a non-paying user, I cannot use the OpenAI API and have been unable to conduct testing. I apologize for this.
If the described function should be in the packages directory, please let me know and I'll recreate the branch.

Note

Adds .docx/.xlsx/.pptx parsing (text, tables, images) using Unstructured and wires it into reading/enrichment with a new office optional dependency.

Readers:
- Add parse_office_doc using unstructured.partition.auto.partition, extracting text, tables (HTML), and images to ParsedText/ParsedMedia.
- Import Image, Table from unstructured and integrate into parsing loop.
- Extend ENRICHMENT_EXTENSIONS to include ".docx", ".xlsx", ".pptx".
- Update read_doc to route *.docx/*.xlsx/*.pptx to parse_office_doc (threaded), preserving chunking/enrichment behavior.
Packaging/Deps:
- Add new optional extra office with unstructured[docx,xlsx,pptx] and include it in dev extras.
- Lockfile updates adding unstructured and required transitive deps; add platform markers for some packages.

^{Written by Cursor Bugbot for commit 95e6e25. Configure here.}

dosubot · 2025-11-01T23:24:21Z

Documentation Updates

1 document(s) were updated by changes in this PR:

paper-qa

Multimodal Support in PaperQA (View Changes)

^{How did I do? Any feedback?}

cursor

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bug: Office Docs Missing Chunking Logic, Failing Processing

Missing chunking logic for office documents (.docx, .xlsx, .pptx). The parse_office_doc function (lines 176-220) creates ParsedText with content as a dict (line 209), but there's no corresponding chunking case in read_doc. Office documents will fall through to the else clause (line 499) which uses chunk_code_text. However, chunk_code_text expects content to be str|list (lines 287-290) and will raise NotImplementedError when given a dict, causing office document processing to fail during chunking.

src/paperqa/readers.py#L454-L505

paper-qa/src/paperqa/readers.py

Lines 454 to 505 in 95e6e25

    
           if chunk_chars == 0: 
        
               chunked_text = [ 
        
                   Text(text=parsed_text.reduce_content(), name=doc.docname, doc=doc) 
        
               ] 
        
               chunk_metadata = ChunkMetadata( 
        
                   size=0, 
        
                   overlap=0, 
        
                   name=( 
        
                       f"paper-qa={pqa_version}|algorithm=none" 
        
                       f"|reduction=cl100k_base{enrichment_summary}" 
        
                   ), 
        
               ) 
        
           elif str_path.endswith(".pdf"): 
        
               chunked_text = chunk_pdf( 
        
                   parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap 
        
               ) 
        
               chunk_metadata = ChunkMetadata( 
        
                   size=chunk_chars, 
        
                   overlap=overlap, 
        
                   name=( 
        
                       f"paper-qa={pqa_version}|algorithm=overlap-pdf" 
        
                       f"|size={chunk_chars}|overlap={overlap}{enrichment_summary}" 
        
                   ), 
        
               ) 
        
           elif str_path.endswith(IMAGE_EXTENSIONS): 
        
               chunked_text = chunk_pdf( 
        
                   parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap 
        
               ) 
        
               chunk_metadata = ChunkMetadata( 
        
                   size=0, 
        
                   overlap=0, 
        
                   name=f"paper-qa={pqa_version}|algorithm=none{enrichment_summary}", 
        
               ) 
        
           elif str_path.endswith((".txt", ".html")): 
        
               chunked_text = chunk_text( 
        
                   parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap 
        
               ) 
        
               chunk_metadata = ChunkMetadata( 
        
                   size=chunk_chars, 
        
                   overlap=overlap, 
        
                   name=( 
        
                       f"paper-qa={pqa_version}|algorithm=overlap-text|reduction=cl100k_base" 
        
                       f"|size={chunk_chars}|overlap={overlap}{enrichment_summary}" 
        
                   ), 
        
               ) 
        
           else: 
        
               chunked_text = chunk_code_text( 
        
                   parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap 
        
               ) 
        
               chunk_metadata = ChunkMetadata( 
        
                   size=chunk_chars,

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2025-11-01T23:25:10Z

src/paperqa/readers.py

+                media_index += 1
+        elif isinstance(el, Table):
+            # For tables, we could get the HTML representation for better structure
+            current_text += el.metadata.text_as_html + "\n\n"


Bug: Null Text Concatenation Crash in HTML Rendering

Potential TypeError when el.metadata.text_as_html is None. The code concatenates el.metadata.text_as_html with a string without checking if it's None first. If the unstructured library returns None for text_as_html on certain Table elements, this will raise a TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'.

jamesbraza · 2025-11-02T00:03:31Z

src/paperqa/readers.py

+from unstructured.documents.elements import Image, Table
+from unstructured.partition.auto import partition


Take a look around the repo for other examples of lazy imports, let's just lazily import these in the first line of parse_office_doc:

try: from unstructured.documents.elements import Image, Table from unstructured.partition.auto import partition except ImportError as exc: raise ImportError( "TODO some message mentioning to install `paper-qa[office]`" ) from exc

jamesbraza · 2025-11-02T00:05:52Z

src/paperqa/readers.py

+    return ParsedText(
+        content=content_dict,
+        metadata=ParsedMetadata(
+            parsing_libraries=["unstructured"],


Can you include the version of unstructured here? Take a look at how the Docling reader does it (at packages/paper-qa-docling)

jamesbraza · 2025-11-02T00:06:36Z

src/paperqa/readers.py

    )


+def parse_office_doc(


Can you make a unit test for this in test_paperqa.py? Feel free to use another LLM besides OpenAI (e.g. Anthropic, OpenRouter) for your testing

[success] 90.64% tests/test_paperqa.py::test_parse_office_doc[dummy.docx]: 1.5548s
[success] 6.33% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx]: 0.1086s
[success] 3.03% tests/test_paperqa.py::test_parse_office_doc[dummy.pptx]: 0.0520s

Results (5.19s):
3 passed

I'm not confident, but the test passed. I'll commit.

And sorry for dummy .docx and .xlsx written in Japanese.

jamesbraza

Nice work, excited about this!

pyproject.toml

src/paperqa/readers.py

…s, .yaml)

takeruhukushima · 2025-11-02T02:48:22Z

Sorry for that I did not testing chunk .docx,.pptx,.xlsx.

If you give me some time, I'll add the new code to tests/test_paperqa.py, run the tests, and then commit it.

jamesbraza

No worries, thanks again for the contribution so far, it's great

src/paperqa/readers.py

Include the version of the 'unstructured' library in the parsing_libraries metadata for office document parsing. This provides better traceability and debugging information for parsed documents.

Moves unstructured library imports into the parse_office_doc function to enable lazy loading. This improves application startup time and allows users to avoid installing unstructured dependencies unless they are processing office documents. An ImportError is now raised with instructions if the necessary dependencies are not found. Also, the unstructured version is now dynamically captured within the function for metadata.

- Add a unit test for to verify parsing of .docx, .pptx, and .xlsx files. - Add dummy office files to for testing purposes. - Update test configuration to use OpenRouter and a non-OpenAI embedding model to avoid authentication issues during testing.

jamesbraza

Can you fix the .mailmap issue in pre-commit? Then this PR should be ready

tests/test_paperqa.py

jamesbraza · 2025-11-02T09:41:49Z

tests/test_paperqa.py

+        llm="openrouter/google/gemma-7b-it",
+        llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")},
+        parsing=ParsingSettings(
+            use_doc_details=False, disable_doc_valid_check=True, defer_embedding=True


Suggested change

use_doc_details=False, disable_doc_valid_check=True, defer_embedding=True

use_doc_details=False, disable_doc_valid_check=True

I think you don't need to defer embeddings, we can embed right away

jamesbraza · 2025-11-02T09:42:43Z

tests/test_paperqa.py

+    docs = Docs()
+    settings = Settings(
+        llm="openrouter/google/gemma-7b-it",
+        llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")},


Suggested change

llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")},

I think you shouldn't need this, litellm should just auto check OPENROUTER_API_KEY: https://docs.litellm.ai/docs/providers/openrouter

So maybe update your local env to have OPENROUTER_API_KEY

Refactor test_parse_office_doc to utilize Gemini models for both LLM and embedding, ensuring compatibility and leveraging Gemini's capabilities. - Configured , , , and to use gemini/gemini-2.5-flash and gemini/text-embedding-004 respectively. - Removed explicit and as Gemini models are expected to pick up API keys from environment variables. - Added a call with a specific question to verify the RAG system's functionality after document addition. - Ensured is not set, allowing immediate embedding.

…shima/paper-qa into feature/new-doc-types

takeruhukushima · 2025-11-02T11:51:50Z

[success] 46.95% tests/test_paperqa.py::test_parse_office_doc[dummy.docx]: 8.4559s
[success] 32.33% tests/test_paperqa.py::test_parse_office_doc[dummy.pptx]: 5.8237s
[success] 20.72% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx]: 3.7317s

Results (20.60s):
3 passed

I`m still not confident, but test is passed.

jamesbraza · 2025-11-02T19:36:45Z

tests/test_paperqa.py

+        embedding="gemini/text-embedding-004",
+        # 他のLLM設定も明示的に指定
+        summary_llm="gemini/gemini-2.5-flash",  # サマリー用
+        agent_llm="gemini/gemini-2.5-flash",  # エージェント用


Suggested change

embedding="gemini/text-embedding-004",

# 他のLLM設定も明示的に指定

summary_llm="gemini/gemini-2.5-flash", # サマリー用

agent_llm="gemini/gemini-2.5-flash", # エージェント用

embedding="gemini/text-embedding-004",

summary_llm="gemini/gemini-2.5-flash",

agent={"agent_llm": "gemini/gemini-2.5-flash"},

No need for these comments, and also agent_llm is one level down

jamesbraza · 2025-11-02T19:37:09Z

tests/test_paperqa.py

+        file_path,
+        "dummy citation",
+        docname=filename,
+        dockey="dummy_doc",


Suggested change

dockey="dummy_doc",

No need to specify dockey if you're not going to use it

jamesbraza · 2025-11-02T19:38:36Z

tests/test_paperqa.py

+        settings=settings,
+    )
+    assert docname is not None
+    assert len(docs.texts) > 0


Can you look at lint CI to fix this part

This commit fixes several linting errors and improves code clarity in : - Removed unnecessary comments and parameter in . - Corrected the configuration. - Replaced with to address FURB115 warning.

takeruhukushima · 2025-11-02T22:21:29Z

[success] 47.26% tests/test_paperqa.py::test_parse_office_doc[dummy.docx]: 8.8184s
[success] 30.70% tests/test_paperqa.py::test_parse_office_doc[dummy.pptx]: 5.7275s
[success] 22.04% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx]: 4.1121s

Results (22.94s):
3 passed

jamesbraza

Very close now, just need the unit test to be good

jamesbraza · 2025-11-03T04:30:31Z

tests/test_paperqa.py

+    session = await docs.aquery("What is the RAG system?", settings=settings)
+    assert session.answer


Okay I actually ran these tests just now a bit, and I noticed the question "What is the RAG system?" only applies to the dummy.docx.

Can you either:

Adjust the question to match each document

Change the .pptx and xlsx to also have content for "What is the RAG system?"

Let's make the assertions:

session = await docs.aquery("What is the RAG system?", settings=settings) assert session.used_contexts assert len(session.answer) > 10, "Expected an answer" assert CANNOT_ANSWER_PHRASE not in session.answer, ( "Expected the system to be sure" )

jamesbraza · 2025-11-03T04:31:05Z

tests/test_paperqa.py

+        embedding="gemini/text-embedding-004",
+        summary_llm="gemini/gemini-2.5-flash",
+        agent={"agent_llm": "gemini/gemini-2.5-flash"},
+        parsing=ParsingSettings(use_doc_details=False, disable_doc_valid_check=True),


Suggested change

parsing=ParsingSettings(use_doc_details=False, disable_doc_valid_check=True),

parsing=ParsingSettings(use_doc_details=False),

These docs should be valid (we don't need disable_doc_valid_check=True)

takeruhukushima · 2025-11-03T06:10:32Z

It seems the entropy calculation for the maybe_is_text() function didn't work properly on Japanese text, so a multilingual fix might be needed. This time, converting the .docx to English worked as a workaround and helped identify the cause.

I can make adjustments if needed. What would you like to do?

jamesbraza · 2025-11-03T06:12:56Z

I can make adjustments if needed. What would you like to do?

I think this PR is ready to go, I was going to merge in the morning after a re-read. Let's do multilingual maybe_is_text in a separate PR

jamesbraza

LGTM, thank you @takeruhukushima much appreciated

takeruhukushima · 2025-11-04T00:56:15Z

@jamesbraza
Thank you for spending two days with me.
I'd like to try it out in my work right away, but how many more days will it take before it's reflected as a Python library?

jamesbraza · 2025-11-04T01:04:17Z

Oh we'll cut a release in a week or do. In the meantime, just do this to install latest main branch:

pip install git+https://github.com/Future-House/paper-qa.git

takeruhukushima added 2 commits November 1, 2025 22:48

feat: Add multimodal support for Office documents

18c58db

solve dependencies problem

95e6e25

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Nov 1, 2025

dosubot bot added the needs information Subject matter is currently underspecified label Nov 1, 2025

cursor bot reviewed Nov 1, 2025

View reviewed changes

takeruhukushima and others added 3 commits November 2, 2025 08:32

fix bug

2f32c8a

[pre-commit.ci lite] apply automatic fixes

f7cc49d

Merge remote-tracking branch 'origin/feature/new-doc-types'

e42e636

jamesbraza reviewed Nov 2, 2025

View reviewed changes

pyproject.toml Show resolved Hide resolved

pyproject.toml Show resolved Hide resolved

takeruhukushima added 2 commits November 2, 2025 10:07

fix:pre-commit fail src/paperqa/readers.py

c801103

add .docx,.pptx,.xlsx in settings.py

5e49ca1

jamesbraza reviewed Nov 2, 2025

View reviewed changes

src/paperqa/readers.py Outdated Show resolved Hide resolved

takeruhukushima added 2 commits November 2, 2025 11:39

refactor(chunks):consolidating the chunk code for office and pdf

65c5097

edit README.md:add .docx, .xlsx, .pptx, and code files (e.g., .py, .t…

5ead779

…s, .yaml)

jamesbraza reviewed Nov 2, 2025

View reviewed changes

src/paperqa/readers.py Outdated Show resolved Hide resolved

takeruhukushima added 4 commits November 2, 2025 13:13

refactor: Unify chunking algorithm name for PDF and office documents

a7c5e3a

feat: Add unstructured version to office document parsing metadata

5fe86c1

Include the version of the 'unstructured' library in the parsing_libraries metadata for office document parsing. This provides better traceability and debugging information for parsed documents.

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Nov 2, 2025

[pre-commit.ci lite] apply automatic fixes

0775523

jamesbraza reviewed Nov 2, 2025

View reviewed changes

takeruhukushima added 4 commits November 2, 2025 20:29

add mailmap takerufukushima

e218d91

fix pre-commit error

4933f16

Merge branch 'feature/new-doc-types' of https://github.com/takeruhuku…

77b95cc

…shima/paper-qa into feature/new-doc-types

jamesbraza reviewed Nov 2, 2025

View reviewed changes

Fix: Address linting issues in test_paperqa.py

280c381

This commit fixes several linting errors and improves code clarity in : - Removed unnecessary comments and parameter in . - Corrected the configuration. - Replaced with to address FURB115 warning.

jamesbraza reviewed Nov 3, 2025

View reviewed changes

takeruhukushima added 2 commits November 3, 2025 14:17

feat: Improve questions and assertions in test_parse_office_doc

d95ed9f

feat: Enhance office document parsing tests and assertions

057c7f8

jamesbraza added 2 commits November 3, 2025 15:10

Minor tweaks to test_parse_office_doc

f4975fc

Updating assertions in other tests for this PR's changes

7bfad14

jamesbraza approved these changes Nov 4, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 4, 2025

jamesbraza merged commit fc2c1b8 into Future-House:main Nov 4, 2025
3 of 7 checks passed


	if chunk_chars == 0:
	chunked_text = [
	Text(text=parsed_text.reduce_content(), name=doc.docname, doc=doc)
	]
	chunk_metadata = ChunkMetadata(
	size=0,
	overlap=0,
	name=(
	f"paper-qa={pqa_version}\|algorithm=none"
	f"\|reduction=cl100k_base{enrichment_summary}"
	),
	)
	elif str_path.endswith(".pdf"):
	chunked_text = chunk_pdf(
	parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap
	)
	chunk_metadata = ChunkMetadata(
	size=chunk_chars,
	overlap=overlap,
	name=(
	f"paper-qa={pqa_version}\|algorithm=overlap-pdf"
	f"\|size={chunk_chars}\|overlap={overlap}{enrichment_summary}"
	),
	)
	elif str_path.endswith(IMAGE_EXTENSIONS):
	chunked_text = chunk_pdf(
	parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap
	)
	chunk_metadata = ChunkMetadata(
	size=0,
	overlap=0,
	name=f"paper-qa={pqa_version}\|algorithm=none{enrichment_summary}",
	)
	elif str_path.endswith((".txt", ".html")):
	chunked_text = chunk_text(
	parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap
	)
	chunk_metadata = ChunkMetadata(
	size=chunk_chars,
	overlap=overlap,
	name=(
	f"paper-qa={pqa_version}\|algorithm=overlap-text\|reduction=cl100k_base"
	f"\|size={chunk_chars}\|overlap={overlap}{enrichment_summary}"
	),
	)
	else:
	chunked_text = chunk_code_text(
	parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap
	)
	chunk_metadata = ChunkMetadata(
	size=chunk_chars,

		from unstructured.documents.elements import Image, Table
		from unstructured.partition.auto import partition

	use_doc_details=False, disable_doc_valid_check=True, defer_embedding=True
	use_doc_details=False, disable_doc_valid_check=True

		session = await docs.aquery("What is the RAG system?", settings=settings)
		assert session.answer

	parsing=ParsingSettings(use_doc_details=False, disable_doc_valid_check=True),
	parsing=ParsingSettings(use_doc_details=False),

		)


		def parse_office_doc(

Feature/new doc types #1169

Feature/new doc types #1169

Uh oh!

Conversation

takeruhukushima commented Nov 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Bug: Office Docs Missing Chunking Logic, Failing Processing

Uh oh!

cursor bot Nov 1, 2025

Choose a reason for hiding this comment

Bug: Null Text Concatenation Crash in HTML Rendering

Uh oh!

jamesbraza Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

takeruhukushima commented Nov 2, 2025

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jamesbraza left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

takeruhukushima commented Nov 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

takeruhukushima commented Nov 2, 2025

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

takeruhukushima commented Nov 3, 2025

Uh oh!

jamesbraza commented Nov 3, 2025

Uh oh!

jamesbraza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

takeruhukushima commented Nov 4, 2025

Uh oh!

jamesbraza commented Nov 4, 2025

takeruhukushima commented Nov 1, 2025 •

edited by cursor bot

Loading

dosubot bot commented Nov 1, 2025 •

edited

Loading

jamesbraza Nov 2, 2025 •

edited

Loading

jamesbraza left a comment •

edited

Loading