Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error messages for missing blocks when parsing incomplete JSON #150

Open
kkhator-aws opened this issue Jun 12, 2023 · 1 comment
Labels
enhancement New feature or request python Relates to the Python version of TRP

Comments

@kkhator-aws
Copy link

kkhator-aws commented Jun 12, 2023

Hi,
My customer is receiving below error when using the textractor with a large multi-page pdf file.

899858907a773d1d5932a263c039a8fced6b281b0e716fbd31366bff7c4392c
Traceback (most recent call last):
  File "C:\Users\YADAVA66\PycharmProjects\pythonProject\main.py", line 80, in <module>
    doc = Document(response)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 633, in __init__
    self._parse()
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 667, in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 516, in __init__
    self._parse(blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 530, in _parse
    l = Line(item, blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 142, in __init__
    if(blockMap[cid]["BlockType"] == "WORD"):
KeyError: '5e06e009-03ac-42cc-9abf-4df8f606c2af'
@schadem schadem transferred this issue from aws-samples/amazon-textract-textractor Jun 12, 2023
@schadem schadem added the enhancement New feature or request label Jun 12, 2023
@schadem
Copy link
Contributor

schadem commented Jun 12, 2023

This is no bug, instead the JSON passed to the trp is not complete and therefore missing an id that is referenced. Usually this happens when an asychronous API is called (Start*) and the result is paginated and then only the first JSON response block is used.
Use the get_full_json_from_output_config or get_full_json from the https://pypi.org/project/amazon-textract-caller/ to get the full JSON object and pass that to the textract-response parser.
Keeping this issue to remind me updating the error message and pointing to this and recommend getting the full JSON.

@athewsey athewsey changed the title Error parsing multiple page pdf Improve error messages for missing blocks when parsing incomplete JSON Jun 7, 2024
@athewsey athewsey added the python Relates to the Python version of TRP label Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python Relates to the Python version of TRP
Projects
None yet
Development

No branches or pull requests

3 participants