Convert Document AI Object to Preserve Layout Text? #159

raad-altaie · 2023-08-30T23:14:32Z

Is your feature request related to a problem? Please describe.

I've been using Google Document AI for text extraction from scanned documents, and it's been working well in terms of extracting text. However, I'm facing an issue when it comes to preserving the layout of the text.

In AWS Textract, there's a tool called "pretty print" that helps maintain the layout of extracted text. Tesseract, on the other hand, allows for preserving interword spaces using the config='-c preserve_interword_spaces=1' option which is kind of does the same thing.
I really wish if "python-documentai-toolbox" could support such output.

Describe the solution you'd like

documentai object => preserved layout text

Describe alternatives you've considered

Extracting text using the pdftotext library seemed like a viable option, but surprisingly, "python-documentai-toolbox" doesn't offer support for PDF output, which is rather baffling.

The text was updated successfully, but these errors were encountered:

holtskinner · 2023-09-08T19:29:45Z

Can you provide more information on what you mean by "preserving the layout of the text"?

Do you want all of the text to be printed to the screen or a TXT file in the same general locations as the source document?

An example of an input document and the output text would be useful.

This will likely be difficult to implement since the layout information extracted from Document AI is using Bounding Boxes with X, Y coordinates (which doesn't apply cleanly to TXT files.)

Document AI by design doesn't fill in the Document.text field with extra spaces/tabs to signify where the text sits on the page.

It could be possible to use the Document.Page.Block field to identify blocks of text and place them generally in the same order, but again this isn't going to be very exact since Coordinates don't have a 1-1 relationship in text files.

raad-altaie · 2023-09-08T21:17:37Z

@holtskinner thank you for your response!
what i am looking for something like the example below.

image:

and the output I am getting is as follows:

Someto the left
Someto the left

Some in the middle
Some in the middle

Some with some tab
Some with some tab

Some with some space between them
Some with some space between them

Sometext here
Sometext here

this much
this much

How do I get the desired output string as of the same structure in image?

i.e. as follows:

 										         Some text here
 										         Some text here

Some to the left
Some to the left

 					Some in the middle
 					Some in the middle

 		Some with some tab
 		Some with some tab

Some with some space between them						this much
Some with some space between them						this much

also do you have an example how i can use Document.Page.Block to restructure the document ( ill give it a try)?

think-diff · 2023-09-23T02:58:21Z

we want to do the same thing here!

ThreeHAN · 2023-12-05T16:56:57Z

At there very least, ensuring there are spaces between words in the text output from document AI would be of great assistance. Sometimes, when words are in different entities but next to each other, the Document AI text blob shows them as twowords as opposed to two words. Having a helper function ensure spaces are there would reduce custom post processing for us.

nonlocalStream · 2024-04-15T20:29:54Z

+1 I want the same thing. Currently I'm using PyMuPdf cli to achieve this python -m fitz gettext https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-text-in-natural-reading-order

Wish the same thing for the document generic OCR (I think the underlying mechanism should be similar, basically reconstructing the layout from the bounding box information https://github.com/pymupdf/PyMuPDF/blob/c0ae13746155e9bb5c11ab7e9a42c2e73758422e/src/__main__.py#L802)

zkalson · 2024-04-18T22:06:59Z

Hey all, I was able to get this mostly working! Here's a rough overview of the process for Python:
-For each page in a document, create a reportlab Canvas object
-Create a text layer on the Canvas object and write the text onto it, using the bounding box data
-Save the PDF and use poppler or pypdf to extract the text layer into a layout-preserved .txt file

The one issue I'm still stuck on is handling documents when GCP performs preprocessing on them see my issue here

If someone is able to help me use the transforms field, I'm happy to invest some time tidying up my code and making a PR with the feature!

Attached is an example input and output.
Input-SampleDocumentAITextLayout.pdf
Output-SampleDocumentAITextLayout.txt

helo-sky · 2024-12-17T23:51:52Z

Any updates on this?
I can't believe Google Cloud OCR doesn't support such a basic output!

meredithslota added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Document AI Object to Preserve Layout Text? #159

Convert Document AI Object to Preserve Layout Text? #159

raad-altaie commented Aug 30, 2023 •

edited

Loading

holtskinner commented Sep 8, 2023 •

edited

Loading

raad-altaie commented Sep 8, 2023

think-diff commented Sep 23, 2023

ThreeHAN commented Dec 5, 2023

nonlocalStream commented Apr 15, 2024

zkalson commented Apr 18, 2024

helo-sky commented Dec 17, 2024

Convert Document AI Object to Preserve Layout Text? #159

Convert Document AI Object to Preserve Layout Text? #159

Comments

raad-altaie commented Aug 30, 2023 • edited Loading

holtskinner commented Sep 8, 2023 • edited Loading

raad-altaie commented Sep 8, 2023

think-diff commented Sep 23, 2023

ThreeHAN commented Dec 5, 2023

nonlocalStream commented Apr 15, 2024

zkalson commented Apr 18, 2024

helo-sky commented Dec 17, 2024

raad-altaie commented Aug 30, 2023 •

edited

Loading

holtskinner commented Sep 8, 2023 •

edited

Loading