Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error extracting images #16

Open
mushroom-matthew opened this issue Sep 29, 2018 · 9 comments
Open

Error extracting images #16

mushroom-matthew opened this issue Sep 29, 2018 · 9 comments

Comments

@mushroom-matthew
Copy link

Hello,
I am working with a database .PDFs containing research articles in a niche set of academic areas. I am hoping to extract all of the Figures and captions. In may instances the default settings can do this, but I have found a few instances where the images extracted are incorrectly colored and/or bold-face Figure titles are not being registered as letterings.

One way I could envision working around this is to extract the Images with the found bounding box plus some extra pixel range on the left, right, and bottom. Is there a way to expand the Image class bounding boxes and extract the info in the new bounding boxes?

Alternatively, if you could help me understand how to change the settings used in both image and letterings extraction, that would be very helpful.

Best,
mushroom-matthew

@mushroom-matthew
Copy link
Author

I've started exploring the various output classes of the minecart workflow. I can now see that each of the following, Documents, Pages, Images, Letterings, etc have functions for adjusting bboxes and finding colors. I will explore these more.

That being said, I would still like to know how to adjust sensitivity to words and how to work with various colorspaces as I have encountered the following error.

PDFNotImplementedError: Colorspace 'PDFObjRef:8>' is not supported

@felipeochoa
Copy link
Owner

You can use the iter_in_bbox method to inspect all the elements inside a bounding box. So you could, e.g., look at an image, expand the bounding box and then iterate on all letterings inside the bounding box.

There is no sensitivity to adjust here -- minecart extracts text as it was created in the PDF, and unfortunately some PDFs break up text into characters or sub-word strings that you have to manually stitch together.

In terms of the image colors and the error above, it looks like you are using the as_pil method. Unfortunately, that's a relatively unfinished part of the library. You could try using img.obj.get_data() and seeing if that will work for you. Otherwise, you can try commenting out lines 274-368 in content.py to see if that fixes the problem for you.

@felipeochoa felipeochoa added the bug label Oct 1, 2018
@mushroom-matthew
Copy link
Author

Thanks for the advice! It helped me start to navigate the classes within this software.

Both of my issues are solved. I thought I'd share my fix for showing images regardless of the PDFObject's Colorspace (specifically if it has 'PDFObjRef:8>').

import io
import PIL

byteArray = image.obj.get_data()
image = PIL.Image.open(io.BytesIO(byteArray))
image.show()

@felipeochoa
Copy link
Owner

Thanks for sharing! I'll leave this up in case anyone wants to properly improve this part of the library

@felipeochoa felipeochoa changed the title Extracting Bold-face Lettering, Color Issues, Expanding Image Bounding box Error extracting images Oct 2, 2018
@mushroom-matthew
Copy link
Author

Felipe -- I am glad that my tip may be helpful in the long run. If this library issue remains, I may take a crack at improving its implementation.

Along those lines, I was wondering if you have explored any other PDF readers or OCR which may be able to better handle some of the other filters, including /CCITTFaxDecode. From a dev standpoint, is the use of other packages frowned upon or do outside developers such as myself have freedom to test various package deployments?

@felipeochoa
Copy link
Owner

Definitely not tied to pdfminer! The premise of this library is exposing a nice interface to work with pdfs, so if we can preserve the outer API while changing the internals, I don't really care one way or another. (I imagine that may be challenging though!)

I have not played with any other readers since I wrote this, and haven't had a need to use OCR. OCR would probably be a nice complement though!

@mushroom-matthew
Copy link
Author

mushroom-matthew commented Oct 3, 2018

In my efforts to "simply" extract the images and captions from my library of ~450 PDFs, I started running into a bunch of problems. While I have some success with the files that were digitally-borne, those that were scanned, for example, were not handled well. In fact, that was just one class of PDFs that weren't handled, where 333 of the 450 documents were dropped for one reason or another. I collected and counted all of the reasons as follows:
"AttributeError: 'PDFGraphicState' object has no attribute 'fill_color'": 15' '"AttributeError: 'PDFGraphicState' object has no attribute 'stroke_color'": 31
"KeyError: 'Cs6'": 1
"KeyError: 'DeviceN'": 1
'OSError: cannot identify image file <_io.BytesIO object at <SOMEKEY>>': 102
"TypeError: a bytes-like object is required, not 'str'": 4,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 11)": 1,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 144)": 1,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 1849)": 1,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 59952)": 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 1)': 3,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 129)': 7,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 13)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 132)': 3,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 160)': 6,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 173)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 176)': 4,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 2)': 11,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 211)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 213)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 223)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 24)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 25)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 26)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 3)': 3,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 30)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 31)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 4)': 2,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 63)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 8)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 88)': 1,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /CCITTFaxDecode': 113,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /JBIG2Decode': 5,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /JPXDecode': 1,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported predictor: 2': 3

Several of these errors arose from being unable to recognize the various fonts that were embedded in the files. Others weren't able to handle scanned images. It's probably a pretty big task to update the library for all of these cases. It should be helpful, however, to know the potential errors one may face when attempting this task.

@Abhiroyq1
Copy link

for 'PDFGraphicState' object has no attribute 'fill_color'

I have changed some code in pdfinterp.py file in pdfminer module.
class PDFGraphicState:-
i have added 2 lines in the initialisation as well as "copy" attribute

pdfinterp

@Akash91
Copy link

Akash91 commented May 21, 2019

I was able to extract the color using this code snippet

import minecart
colors = set()

with open("{pathtoyourPDFhere}.pdf", "rb") as file:
document = minecart.Document(file)
page = document.get_page(0)
for shape in page.shapes:
if shape.fill:
colors.add(shape.fill.color.as_rgb())

for color in colors: print (color)

But as this gives us rgb colors which are not same the colors which are printed i.e DeviceCMYK
I have tried ghostscript, imagemagik and other libs but all of them provide a class which does not have to_cmyk() method.

I am looking for contributors who can help me address this issue, let me know if anyone is interested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants