Skip to content

Conversation

@takeruhukushima
Copy link
Contributor

@takeruhukushima takeruhukushima commented Nov 1, 2025

As a non-paying user, I cannot use the OpenAI API and have been unable to conduct testing. I apologize for this.
If the described function should be in the packages directory, please let me know and I'll recreate the branch.


Note

Adds .docx/.xlsx/.pptx parsing (text, tables, images) using Unstructured and wires it into reading/enrichment with a new office optional dependency.

  • Readers:
    • Add parse_office_doc using unstructured.partition.auto.partition, extracting text, tables (HTML), and images to ParsedText/ParsedMedia.
    • Import Image, Table from unstructured and integrate into parsing loop.
    • Extend ENRICHMENT_EXTENSIONS to include ".docx", ".xlsx", ".pptx".
    • Update read_doc to route *.docx/*.xlsx/*.pptx to parse_office_doc (threaded), preserving chunking/enrichment behavior.
  • Packaging/Deps:
    • Add new optional extra office with unstructured[docx,xlsx,pptx] and include it in dev extras.
    • Lockfile updates adding unstructured and required transitive deps; add platform markers for some packages.

Written by Cursor Bugbot for commit 95e6e25. Configure here.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Nov 1, 2025
@dosubot
Copy link

dosubot bot commented Nov 1, 2025

Documentation Updates

1 document(s) were updated by changes in this PR:

paper-qa

How did I do? Any feedback?  Join Discord

@dosubot dosubot bot added the needs information Subject matter is currently underspecified label Nov 1, 2025
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bug: Office Docs Missing Chunking Logic, Failing Processing

Missing chunking logic for office documents (.docx, .xlsx, .pptx). The parse_office_doc function (lines 176-220) creates ParsedText with content as a dict (line 209), but there's no corresponding chunking case in read_doc. Office documents will fall through to the else clause (line 499) which uses chunk_code_text. However, chunk_code_text expects content to be str|list (lines 287-290) and will raise NotImplementedError when given a dict, causing office document processing to fail during chunking.

src/paperqa/readers.py#L454-L505

if chunk_chars == 0:
chunked_text = [
Text(text=parsed_text.reduce_content(), name=doc.docname, doc=doc)
]
chunk_metadata = ChunkMetadata(
size=0,
overlap=0,
name=(
f"paper-qa={pqa_version}|algorithm=none"
f"|reduction=cl100k_base{enrichment_summary}"
),
)
elif str_path.endswith(".pdf"):
chunked_text = chunk_pdf(
parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap
)
chunk_metadata = ChunkMetadata(
size=chunk_chars,
overlap=overlap,
name=(
f"paper-qa={pqa_version}|algorithm=overlap-pdf"
f"|size={chunk_chars}|overlap={overlap}{enrichment_summary}"
),
)
elif str_path.endswith(IMAGE_EXTENSIONS):
chunked_text = chunk_pdf(
parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap
)
chunk_metadata = ChunkMetadata(
size=0,
overlap=0,
name=f"paper-qa={pqa_version}|algorithm=none{enrichment_summary}",
)
elif str_path.endswith((".txt", ".html")):
chunked_text = chunk_text(
parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap
)
chunk_metadata = ChunkMetadata(
size=chunk_chars,
overlap=overlap,
name=(
f"paper-qa={pqa_version}|algorithm=overlap-text|reduction=cl100k_base"
f"|size={chunk_chars}|overlap={overlap}{enrichment_summary}"
),
)
else:
chunked_text = chunk_code_text(
parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap
)
chunk_metadata = ChunkMetadata(
size=chunk_chars,

Fix in Cursor Fix in Web


Comment @cursor review or bugbot run to trigger another review on this PR

media_index += 1
elif isinstance(el, Table):
# For tables, we could get the HTML representation for better structure
current_text += el.metadata.text_as_html + "\n\n"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Null Text Concatenation Crash in HTML Rendering

Potential TypeError when el.metadata.text_as_html is None. The code concatenates el.metadata.text_as_html with a string without checking if it's None first. If the unstructured library returns None for text_as_html on certain Table elements, this will raise a TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'.

Fix in Cursor Fix in Web

Comment on lines 14 to 15
from unstructured.documents.elements import Image, Table
from unstructured.partition.auto import partition
Copy link
Collaborator

@jamesbraza jamesbraza Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look around the repo for other examples of lazy imports, let's just lazily import these in the first line of parse_office_doc:

try:
    from unstructured.documents.elements import Image, Table
    from unstructured.partition.auto import partition
except ImportError as exc:
    raise ImportError(
        "TODO some message mentioning to install `paper-qa[office]`"
    ) from exc

return ParsedText(
content=content_dict,
metadata=ParsedMetadata(
parsing_libraries=["unstructured"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include the version of unstructured here? Take a look at how the Docling reader does it (at packages/paper-qa-docling)

)


def parse_office_doc(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a unit test for this in test_paperqa.py? Feel free to use another LLM besides OpenAI (e.g. Anthropic, OpenRouter) for your testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[success] 90.64% tests/test_paperqa.py::test_parse_office_doc[dummy.docx]: 1.5548s
[success] 6.33% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx]: 0.1086s
[success] 3.03% tests/test_paperqa.py::test_parse_office_doc[dummy.pptx]: 0.0520s

Results (5.19s):
3 passed

I'm not confident, but the test passed. I'll commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And sorry for dummy .docx and .xlsx written in Japanese.

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, excited about this!

@takeruhukushima
Copy link
Contributor Author

Sorry for that I did not testing chunk .docx,.pptx,.xlsx.

If you give me some time, I'll add the new code to tests/test_paperqa.py, run the tests, and then commit it.

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, thanks again for the contribution so far, it's great

  Include the version of the 'unstructured' library in the parsing_libraries metadata for office document
  parsing. This provides better traceability and debugging information for parsed documents.
  Moves unstructured library imports into the parse_office_doc function to enable lazy loading. This improves
  application startup time and allows users to avoid installing unstructured dependencies unless they are
  processing office documents. An ImportError is now raised with instructions if the necessary dependencies
  are not found. Also, the unstructured version is now dynamically captured within the function for metadata.
    - Add a unit test for  to verify parsing of .docx, .pptx, and .xlsx files.
    - Add dummy office files to  for testing purposes.
    - Update test configuration to use OpenRouter and a non-OpenAI embedding model to avoid authentication
     issues during testing.
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Nov 2, 2025
Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the .mailmap issue in pre-commit? Then this PR should be ready

llm="openrouter/google/gemma-7b-it",
llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")},
parsing=ParsingSettings(
use_doc_details=False, disable_doc_valid_check=True, defer_embedding=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use_doc_details=False, disable_doc_valid_check=True, defer_embedding=True
use_doc_details=False, disable_doc_valid_check=True

I think you don't need to defer embeddings, we can embed right away

docs = Docs()
settings = Settings(
llm="openrouter/google/gemma-7b-it",
llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")},

I think you shouldn't need this, litellm should just auto check OPENROUTER_API_KEY: https://docs.litellm.ai/docs/providers/openrouter

So maybe update your local env to have OPENROUTER_API_KEY

    Refactor test_parse_office_doc to utilize Gemini models for both LLM and embedding,
    ensuring compatibility and leveraging Gemini's capabilities.

    - Configured , , , and  to use
      gemini/gemini-2.5-flash and gemini/text-embedding-004 respectively.
    - Removed explicit  and  as Gemini models
       are expected to pick up API keys from environment variables.
    - Added a  call with a specific question to verify the RAG system's
      functionality after document addition.
    - Ensured  is not set, allowing immediate embedding.
@takeruhukushima
Copy link
Contributor Author

[success] 46.95% tests/test_paperqa.py::test_parse_office_doc[dummy.docx]: 8.4559s
[success] 32.33% tests/test_paperqa.py::test_parse_office_doc[dummy.pptx]: 5.8237s
[success] 20.72% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx]: 3.7317s

Results (20.60s):
3 passed

I`m still not confident, but test is passed.

Comment on lines 3124 to 3127
embedding="gemini/text-embedding-004",
# 他のLLM設定も明示的に指定
summary_llm="gemini/gemini-2.5-flash", # サマリー用
agent_llm="gemini/gemini-2.5-flash", # エージェント用
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
embedding="gemini/text-embedding-004",
# 他のLLM設定も明示的に指定
summary_llm="gemini/gemini-2.5-flash", # サマリー用
agent_llm="gemini/gemini-2.5-flash", # エージェント用
embedding="gemini/text-embedding-004",
summary_llm="gemini/gemini-2.5-flash",
agent={"agent_llm": "gemini/gemini-2.5-flash"},

No need for these comments, and also agent_llm is one level down

file_path,
"dummy citation",
docname=filename,
dockey="dummy_doc",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dockey="dummy_doc",

No need to specify dockey if you're not going to use it

settings=settings,
)
assert docname is not None
assert len(docs.texts) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look at lint CI to fix this part

    This commit fixes several linting errors and improves code clarity in
     :

    - Removed unnecessary comments and  parameter in
     .
    - Corrected the  configuration.
    - Replaced  with  to address FURB115
     warning.
@takeruhukushima
Copy link
Contributor Author

[success] 47.26% tests/test_paperqa.py::test_parse_office_doc[dummy.docx]: 8.8184s
[success] 30.70% tests/test_paperqa.py::test_parse_office_doc[dummy.pptx]: 5.7275s
[success] 22.04% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx]: 4.1121s

Results (22.94s):
3 passed

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very close now, just need the unit test to be good

Comment on lines 3137 to 3138
session = await docs.aquery("What is the RAG system?", settings=settings)
assert session.answer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I actually ran these tests just now a bit, and I noticed the question "What is the RAG system?" only applies to the dummy.docx.

Can you either:

  • Adjust the question to match each document
  • Change the .pptx and xlsx to also have content for "What is the RAG system?"

Let's make the assertions:

session = await docs.aquery("What is the RAG system?", settings=settings)
assert session.used_contexts
assert len(session.answer) > 10, "Expected an answer"
assert CANNOT_ANSWER_PHRASE not in session.answer, (
    "Expected the system to be sure"
)

embedding="gemini/text-embedding-004",
summary_llm="gemini/gemini-2.5-flash",
agent={"agent_llm": "gemini/gemini-2.5-flash"},
parsing=ParsingSettings(use_doc_details=False, disable_doc_valid_check=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parsing=ParsingSettings(use_doc_details=False, disable_doc_valid_check=True),
parsing=ParsingSettings(use_doc_details=False),

These docs should be valid (we don't need disable_doc_valid_check=True)

@takeruhukushima
Copy link
Contributor Author

It seems the entropy calculation for the maybe_is_text() function didn't work properly on Japanese text, so a multilingual fix might be needed. This time, converting the .docx to English worked as a workaround and helped identify the cause.

I can make adjustments if needed. What would you like to do?

@jamesbraza
Copy link
Collaborator

I can make adjustments if needed. What would you like to do?

I think this PR is ready to go, I was going to merge in the morning after a re-read. Let's do multilingual maybe_is_text in a separate PR

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @takeruhukushima much appreciated

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 4, 2025
@jamesbraza jamesbraza merged commit fc2c1b8 into Future-House:main Nov 4, 2025
3 of 7 checks passed
@takeruhukushima
Copy link
Contributor Author

@jamesbraza
Thank you for spending two days with me.
I'd like to try it out in my work right away, but how many more days will it take before it's reflected as a Python library?

@jamesbraza
Copy link
Collaborator

Oh we'll cut a release in a week or do. In the meantime, just do this to install latest main branch:

pip install git+https://github.com/Future-House/paper-qa.git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer needs information Subject matter is currently underspecified size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants