-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error extracting images #16
Comments
I've started exploring the various output classes of the minecart workflow. I can now see that each of the following, Documents, Pages, Images, Letterings, etc have functions for adjusting bboxes and finding colors. I will explore these more. That being said, I would still like to know how to adjust sensitivity to words and how to work with various colorspaces as I have encountered the following error. PDFNotImplementedError: Colorspace 'PDFObjRef:8>' is not supported |
You can use the There is no sensitivity to adjust here -- minecart extracts text as it was created in the PDF, and unfortunately some PDFs break up text into characters or sub-word strings that you have to manually stitch together. In terms of the image colors and the error above, it looks like you are using the |
Thanks for the advice! It helped me start to navigate the classes within this software. Both of my issues are solved. I thought I'd share my fix for showing images regardless of the PDFObject's Colorspace (specifically if it has 'PDFObjRef:8>').
|
Thanks for sharing! I'll leave this up in case anyone wants to properly improve this part of the library |
Felipe -- I am glad that my tip may be helpful in the long run. If this library issue remains, I may take a crack at improving its implementation. Along those lines, I was wondering if you have explored any other PDF readers or OCR which may be able to better handle some of the other filters, including |
Definitely not tied to pdfminer! The premise of this library is exposing a nice interface to work with pdfs, so if we can preserve the outer API while changing the internals, I don't really care one way or another. (I imagine that may be challenging though!) I have not played with any other readers since I wrote this, and haven't had a need to use OCR. OCR would probably be a nice complement though! |
In my efforts to "simply" extract the images and captions from my library of ~450 PDFs, I started running into a bunch of problems. While I have some success with the files that were digitally-borne, those that were scanned, for example, were not handled well. In fact, that was just one class of PDFs that weren't handled, where 333 of the 450 documents were dropped for one reason or another. I collected and counted all of the reasons as follows: Several of these errors arose from being unable to recognize the various fonts that were embedded in the files. Others weren't able to handle scanned images. It's probably a pretty big task to update the library for all of these cases. It should be helpful, however, to know the potential errors one may face when attempting this task. |
I was able to extract the color using this code snippet
But as this gives us rgb colors which are not same the colors which are printed i.e DeviceCMYK I am looking for contributors who can help me address this issue, let me know if anyone is interested |
Hello,
I am working with a database .PDFs containing research articles in a niche set of academic areas. I am hoping to extract all of the Figures and captions. In may instances the default settings can do this, but I have found a few instances where the images extracted are incorrectly colored and/or bold-face Figure titles are not being registered as letterings.
One way I could envision working around this is to extract the Images with the found bounding box plus some extra pixel range on the left, right, and bottom. Is there a way to expand the Image class bounding boxes and extract the info in the new bounding boxes?
Alternatively, if you could help me understand how to change the settings used in both image and letterings extraction, that would be very helpful.
Best,
mushroom-matthew
The text was updated successfully, but these errors were encountered: