Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👀 Integrate with Amazon Textextract #6

Open
awtkns opened this issue Nov 13, 2023 · 14 comments
Open

👀 Integrate with Amazon Textextract #6

awtkns opened this issue Nov 13, 2023 · 14 comments
Labels
good first issue Good for newcomers

Comments

@awtkns
Copy link
Member

awtkns commented Nov 13, 2023

Currently the only OCR service tarsier supports is GoogleOCR vision. It would be good to provide another ocr service that allows textextract to be used

@awtkns awtkns added the good first issue Good for newcomers label Nov 13, 2023
@shubhamofbce
Copy link

I think we need this asap, because google vision is not working as expected for any complex website. I am working on this.

@awtkns
Copy link
Member Author

awtkns commented Jan 12, 2024

@shubhamofbce let me know if you need support!

@plamb-viso
Copy link

bump; very interested in testing this library out using textract output

@asim-shrestha
Copy link
Contributor

asim-shrestha commented May 16, 2024

@plamb-viso happy to take a PR! It should be fairly straightforward as we have this somewhat abstracted.

We'd also really like to test out Azure OCR as we've heard its the most performant. (Will make a separate issue for this)

@asim-shrestha
Copy link
Contributor

And any luck @shubhamofbce ?

@shubhamofbce
Copy link

@asim-shrestha Sorry I have not update.
I looked into it long back, it was straight forward but didn't get a chance to complete it and create a PR and now I don't have that with me.

@asim-shrestha
Copy link
Contributor

No worries @shubhamofbce , did you still want to tackle this?

@shubhamofbce
Copy link

Sorry, but I will not be able to work on it due to time constraint. @asim-shrestha

@Loeing
Copy link

Loeing commented Jun 20, 2024

I think I should be able to tackle this next week

@awtkns
Copy link
Member Author

awtkns commented Jun 28, 2024

Hey @Loeing let me know if you you need any support on this one.

@Loeing
Copy link

Loeing commented Jun 29, 2024

@awtkns sorry this past week has been busier than anticipated. Have been playing around with Tarsier. Should be able to make some progress by the end of next week

@tvatter
Copy link

tvatter commented Jul 11, 2024

@Loeing I'm super interested in the ability to integrate with Amazon Textextract. Have you made any progress on this? Is there any chance I can be of some assistance?

@mscully4
Copy link

Howdy! I pulled down the code and tried my hand at integrating with AWS Textract. I ran into a small problem, Textract only returns normalized geometry data (values between 0 and 1), which differs from GCP & Azure. This seems to cause an issue with this line of the format_text method, which checks spacing between annotations using 10 pixels as its baseline. Since the data is normalized, everything gets squished onto one line in the output. De-normalizing the data (multiplying the normalized values by the height/width of the image) fixed the issue and produced correct looking output. The question I have is: would you rather I just de-normalize the Textract response data or should the format_text function be updated to only operation using normalized values?

@philipbjorge
Copy link

Does anyone have any published WIP branches available to look at?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

8 participants