Improve error messages for missing blocks when parsing incomplete JSON #150

kkhator-aws · 2023-06-12T18:13:21Z

Hi,
My customer is receiving below error when using the textractor with a large multi-page pdf file.

899858907a773d1d5932a263c039a8fced6b281b0e716fbd31366bff7c4392c
Traceback (most recent call last):
  File "C:\Users\YADAVA66\PycharmProjects\pythonProject\main.py", line 80, in <module>
    doc = Document(response)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 633, in __init__
    self._parse()
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 667, in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 516, in __init__
    self._parse(blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 530, in _parse
    l = Line(item, blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 142, in __init__
    if(blockMap[cid]["BlockType"] == "WORD"):
KeyError: '5e06e009-03ac-42cc-9abf-4df8f606c2af'

The text was updated successfully, but these errors were encountered:

schadem · 2023-06-12T19:11:33Z

This is no bug, instead the JSON passed to the trp is not complete and therefore missing an id that is referenced. Usually this happens when an asychronous API is called (Start*) and the result is paginated and then only the first JSON response block is used.
Use the get_full_json_from_output_config or get_full_json from the https://pypi.org/project/amazon-textract-caller/ to get the full JSON object and pass that to the textract-response parser.
Keeping this issue to remind me updating the error message and pointing to this and recommend getting the full JSON.

schadem transferred this issue from aws-samples/amazon-textract-textractor Jun 12, 2023

schadem added the enhancement New feature or request label Jun 12, 2023

athewsey changed the title ~~Error parsing multiple page pdf~~ Improve error messages for missing blocks when parsing incomplete JSON Jun 7, 2024

athewsey added the python Relates to the Python version of TRP label Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error messages for missing blocks when parsing incomplete JSON #150

Improve error messages for missing blocks when parsing incomplete JSON #150

kkhator-aws commented Jun 12, 2023 •

edited by athewsey

Loading

schadem commented Jun 12, 2023

Improve error messages for missing blocks when parsing incomplete JSON #150

Improve error messages for missing blocks when parsing incomplete JSON #150

Comments

kkhator-aws commented Jun 12, 2023 • edited by athewsey Loading

schadem commented Jun 12, 2023

kkhator-aws commented Jun 12, 2023 •

edited by athewsey

Loading