-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Document AI Object to Preserve Layout Text? #159
Comments
Can you provide more information on what you mean by "preserving the layout of the text"? Do you want all of the text to be printed to the screen or a TXT file in the same general locations as the source document? An example of an input document and the output text would be useful. This will likely be difficult to implement since the layout information extracted from Document AI is using Bounding Boxes with X, Y coordinates (which doesn't apply cleanly to TXT files.) Document AI by design doesn't fill in the It could be possible to use the |
@holtskinner thank you for your response! image: and the output I am getting is as follows:
How do I get the desired output string as of the same structure in image? i.e. as follows:
|
we want to do the same thing here! |
At there very least, ensuring there are spaces between words in the text output from document AI would be of great assistance. Sometimes, when words are in different entities but next to each other, the Document AI text blob shows them as |
+1 I want the same thing. Currently I'm using PyMuPdf cli to achieve this Wish the same thing for the document generic OCR (I think the underlying mechanism should be similar, basically reconstructing the layout from the bounding box information https://github.com/pymupdf/PyMuPDF/blob/c0ae13746155e9bb5c11ab7e9a42c2e73758422e/src/__main__.py#L802) |
Hey all, I was able to get this mostly working! Here's a rough overview of the process for Python: The one issue I'm still stuck on is handling documents when GCP performs preprocessing on them see my issue here If someone is able to help me use the transforms field, I'm happy to invest some time tidying up my code and making a PR with the feature! Attached is an example input and output. |
Any updates on this? |
Is your feature request related to a problem? Please describe.
I've been using Google Document AI for text extraction from scanned documents, and it's been working well in terms of extracting text. However, I'm facing an issue when it comes to preserving the layout of the text.
In AWS Textract, there's a tool called "pretty print" that helps maintain the layout of extracted text. Tesseract, on the other hand, allows for preserving interword spaces using the
config='-c preserve_interword_spaces=1'
option which is kind of does the same thing.I really wish if "python-documentai-toolbox" could support such output.
Describe the solution you'd like
documentai object => preserved layout text
Describe alternatives you've considered
Extracting text using the
pdftotext
library seemed like a viable option, but surprisingly, "python-documentai-toolbox" doesn't offer support for PDF output, which is rather baffling.The text was updated successfully, but these errors were encountered: