👀 Integrate with Amazon Textextract #6

awtkns · 2023-11-13T18:31:48Z

Currently the only OCR service tarsier supports is GoogleOCR vision. It would be good to provide another ocr service that allows textextract to be used

shubhamofbce · 2024-01-09T05:42:32Z

I think we need this asap, because google vision is not working as expected for any complex website. I am working on this.

awtkns · 2024-01-12T06:40:51Z

@shubhamofbce let me know if you need support!

plamb-viso · 2024-05-16T13:38:55Z

bump; very interested in testing this library out using textract output

asim-shrestha · 2024-05-16T17:09:35Z

@plamb-viso happy to take a PR! It should be fairly straightforward as we have this somewhat abstracted.

We'd also really like to test out Azure OCR as we've heard its the most performant. (Will make a separate issue for this)

asim-shrestha · 2024-05-16T17:11:22Z

And any luck @shubhamofbce ?

shubhamofbce · 2024-05-17T08:20:04Z

@asim-shrestha Sorry I have not update.
I looked into it long back, it was straight forward but didn't get a chance to complete it and create a PR and now I don't have that with me.

asim-shrestha · 2024-05-20T18:14:09Z

No worries @shubhamofbce , did you still want to tackle this?

shubhamofbce · 2024-05-23T16:47:42Z

Sorry, but I will not be able to work on it due to time constraint. @asim-shrestha

Loeing · 2024-06-20T23:54:50Z

I think I should be able to tackle this next week

awtkns · 2024-06-28T17:03:25Z

Hey @Loeing let me know if you you need any support on this one.

Loeing · 2024-06-29T15:28:19Z

@awtkns sorry this past week has been busier than anticipated. Have been playing around with Tarsier. Should be able to make some progress by the end of next week

tvatter · 2024-07-11T07:51:19Z

@Loeing I'm super interested in the ability to integrate with Amazon Textextract. Have you made any progress on this? Is there any chance I can be of some assistance?

mscully4 · 2024-07-23T04:34:35Z

Howdy! I pulled down the code and tried my hand at integrating with AWS Textract. I ran into a small problem, Textract only returns normalized geometry data (values between 0 and 1), which differs from GCP & Azure. This seems to cause an issue with this line of the format_text method, which checks spacing between annotations using 10 pixels as its baseline. Since the data is normalized, everything gets squished onto one line in the output. De-normalizing the data (multiplying the normalized values by the height/width of the image) fixed the issue and produced correct looking output. The question I have is: would you rather I just de-normalize the Textract response data or should the format_text function be updated to only operation using normalized values?

philipbjorge · 2024-09-20T13:42:30Z

Does anyone have any published WIP branches available to look at?
Thanks

awtkns added the good first issue Good for newcomers label Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👀 Integrate with Amazon Textextract #6

👀 Integrate with Amazon Textextract #6

awtkns commented Nov 13, 2023

shubhamofbce commented Jan 9, 2024

awtkns commented Jan 12, 2024

plamb-viso commented May 16, 2024

asim-shrestha commented May 16, 2024 •

edited

Loading

asim-shrestha commented May 16, 2024

shubhamofbce commented May 17, 2024

asim-shrestha commented May 20, 2024

shubhamofbce commented May 23, 2024

Loeing commented Jun 20, 2024

awtkns commented Jun 28, 2024

Loeing commented Jun 29, 2024

tvatter commented Jul 11, 2024

mscully4 commented Jul 23, 2024

philipbjorge commented Sep 20, 2024

👀 Integrate with Amazon Textextract #6

👀 Integrate with Amazon Textextract #6

Comments

awtkns commented Nov 13, 2023

shubhamofbce commented Jan 9, 2024

awtkns commented Jan 12, 2024

plamb-viso commented May 16, 2024

asim-shrestha commented May 16, 2024 • edited Loading

asim-shrestha commented May 16, 2024

shubhamofbce commented May 17, 2024

asim-shrestha commented May 20, 2024

shubhamofbce commented May 23, 2024

Loeing commented Jun 20, 2024

awtkns commented Jun 28, 2024

Loeing commented Jun 29, 2024

tvatter commented Jul 11, 2024

mscully4 commented Jul 23, 2024

philipbjorge commented Sep 20, 2024

asim-shrestha commented May 16, 2024 •

edited

Loading