Error extracting images #16

mushroom-matthew · 2018-09-29T16:10:58Z

Hello,
I am working with a database .PDFs containing research articles in a niche set of academic areas. I am hoping to extract all of the Figures and captions. In may instances the default settings can do this, but I have found a few instances where the images extracted are incorrectly colored and/or bold-face Figure titles are not being registered as letterings.

One way I could envision working around this is to extract the Images with the found bounding box plus some extra pixel range on the left, right, and bottom. Is there a way to expand the Image class bounding boxes and extract the info in the new bounding boxes?

Alternatively, if you could help me understand how to change the settings used in both image and letterings extraction, that would be very helpful.

Best,
mushroom-matthew

mushroom-matthew · 2018-09-29T16:28:02Z

I've started exploring the various output classes of the minecart workflow. I can now see that each of the following, Documents, Pages, Images, Letterings, etc have functions for adjusting bboxes and finding colors. I will explore these more.

That being said, I would still like to know how to adjust sensitivity to words and how to work with various colorspaces as I have encountered the following error.

PDFNotImplementedError: Colorspace 'PDFObjRef:8>' is not supported

felipeochoa · 2018-10-01T13:53:05Z

You can use the iter_in_bbox method to inspect all the elements inside a bounding box. So you could, e.g., look at an image, expand the bounding box and then iterate on all letterings inside the bounding box.

There is no sensitivity to adjust here -- minecart extracts text as it was created in the PDF, and unfortunately some PDFs break up text into characters or sub-word strings that you have to manually stitch together.

In terms of the image colors and the error above, it looks like you are using the as_pil method. Unfortunately, that's a relatively unfinished part of the library. You could try using img.obj.get_data() and seeing if that will work for you. Otherwise, you can try commenting out lines 274-368 in content.py to see if that fixes the problem for you.

mushroom-matthew · 2018-10-02T02:48:22Z

Thanks for the advice! It helped me start to navigate the classes within this software.

Both of my issues are solved. I thought I'd share my fix for showing images regardless of the PDFObject's Colorspace (specifically if it has 'PDFObjRef:8>').

import io
import PIL

byteArray = image.obj.get_data()
image = PIL.Image.open(io.BytesIO(byteArray))
image.show()

felipeochoa · 2018-10-02T02:50:54Z

Thanks for sharing! I'll leave this up in case anyone wants to properly improve this part of the library

mushroom-matthew · 2018-10-02T15:16:47Z

Felipe -- I am glad that my tip may be helpful in the long run. If this library issue remains, I may take a crack at improving its implementation.

Along those lines, I was wondering if you have explored any other PDF readers or OCR which may be able to better handle some of the other filters, including /CCITTFaxDecode. From a dev standpoint, is the use of other packages frowned upon or do outside developers such as myself have freedom to test various package deployments?

felipeochoa · 2018-10-02T15:20:22Z

Definitely not tied to pdfminer! The premise of this library is exposing a nice interface to work with pdfs, so if we can preserve the outer API while changing the internals, I don't really care one way or another. (I imagine that may be challenging though!)

I have not played with any other readers since I wrote this, and haven't had a need to use OCR. OCR would probably be a nice complement though!

mushroom-matthew · 2018-10-03T01:55:23Z

In my efforts to "simply" extract the images and captions from my library of ~450 PDFs, I started running into a bunch of problems. While I have some success with the files that were digitally-borne, those that were scanned, for example, were not handled well. In fact, that was just one class of PDFs that weren't handled, where 333 of the 450 documents were dropped for one reason or another. I collected and counted all of the reasons as follows:
"AttributeError: 'PDFGraphicState' object has no attribute 'fill_color'": 15' '"AttributeError: 'PDFGraphicState' object has no attribute 'stroke_color'": 31
"KeyError: 'Cs6'": 1
"KeyError: 'DeviceN'": 1
'OSError: cannot identify image file <_io.BytesIO object at <SOMEKEY>>': 102
"TypeError: a bytes-like object is required, not 'str'": 4,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 11)": 1,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 144)": 1,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 1849)": 1,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 59952)": 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 1)': 3,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 129)': 7,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 13)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 132)': 3,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 160)': 6,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 173)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 176)': 4,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 2)': 11,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 211)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 213)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 223)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 24)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 25)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 26)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 3)': 3,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 30)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 31)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 4)': 2,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 63)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 8)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 88)': 1,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /CCITTFaxDecode': 113,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /JBIG2Decode': 5,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /JPXDecode': 1,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported predictor: 2': 3

Several of these errors arose from being unable to recognize the various fonts that were embedded in the files. Others weren't able to handle scanned images. It's probably a pretty big task to update the library for all of these cases. It should be helpful, however, to know the potential errors one may face when attempting this task.

Abhiroyq1 · 2018-11-20T08:51:56Z

for 'PDFGraphicState' object has no attribute 'fill_color'

I have changed some code in pdfinterp.py file in pdfminer module.
class PDFGraphicState:-
i have added 2 lines in the initialisation as well as "copy" attribute

Akash91 · 2019-05-21T08:50:05Z

I was able to extract the color using this code snippet

import minecart
colors = set()

with open("{pathtoyourPDFhere}.pdf", "rb") as file:
document = minecart.Document(file)
page = document.get_page(0)
for shape in page.shapes:
if shape.fill:
colors.add(shape.fill.color.as_rgb())

for color in colors: print (color)

But as this gives us rgb colors which are not same the colors which are printed i.e DeviceCMYK
I have tried ghostscript, imagemagik and other libs but all of them provide a class which does not have to_cmyk() method.

I am looking for contributors who can help me address this issue, let me know if anyone is interested

felipeochoa added the bug label Oct 1, 2018

felipeochoa added the help wanted label Oct 2, 2018

felipeochoa changed the title ~~Extracting Bold-face Lettering, Color Issues, Expanding Image Bounding box~~ Error extracting images Oct 2, 2018

eplebel mentioned this issue Aug 28, 2019

curation & display-based (low-priority) improvements ScienceCommons/curate_science#66

Open

90 tasks

eplebel mentioned this issue Sep 30, 2019

extract figures directly from PDF functionality ScienceCommons/curate_science#73

Open

vgopinath mentioned this issue Nov 22, 2019

PDFNotImplementedError: Colorspace 'PDFObjRef:100>' is not supported #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error extracting images #16

Error extracting images #16

mushroom-matthew commented Sep 29, 2018

mushroom-matthew commented Sep 29, 2018

felipeochoa commented Oct 1, 2018

mushroom-matthew commented Oct 2, 2018

felipeochoa commented Oct 2, 2018

mushroom-matthew commented Oct 2, 2018

felipeochoa commented Oct 2, 2018

mushroom-matthew commented Oct 3, 2018 •

edited

Loading

Abhiroyq1 commented Nov 20, 2018

Akash91 commented May 21, 2019 •

edited

Loading

Error extracting images #16

Error extracting images #16

Comments

mushroom-matthew commented Sep 29, 2018

mushroom-matthew commented Sep 29, 2018

felipeochoa commented Oct 1, 2018

mushroom-matthew commented Oct 2, 2018

felipeochoa commented Oct 2, 2018

mushroom-matthew commented Oct 2, 2018

felipeochoa commented Oct 2, 2018

mushroom-matthew commented Oct 3, 2018 • edited Loading

Abhiroyq1 commented Nov 20, 2018

Akash91 commented May 21, 2019 • edited Loading

mushroom-matthew commented Oct 3, 2018 •

edited

Loading

Akash91 commented May 21, 2019 •

edited

Loading