Skip to content

Conversation

@Jim-Encord
Copy link
Contributor

Feature, Tests, Docs not yet.

Currently adds dependency. Could technically become optional dependency

@Jim-Encord Jim-Encord requested a review from a team as a code owner April 7, 2025 13:08
@github-actions
Copy link

github-actions bot commented Apr 7, 2025

Encord Agents test report

90 tests   89 ✅  3m 18s ⏱️
 1 suites   1 💤
 1 files     0 ❌

Results for commit 17d4bb9.

♻️ This comment has been updated with latest results.

import pymupdf


def extract_page(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't really be done by like other method as doc.close() apparently divorces pix so best to cast to file immediately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should fix this straight away. The pumupdf.Pixmap has a pil_image method so we can make the optional dependencies [mymupdf, PILLOW] and then use that. Then, no need to store on disk.

@Jim-Encord
Copy link
Contributor Author

On a high level, this would be improved if instead we didn't write the intermediary crops to a file. As is, I'm just emulating the behaviour for videos but this could be improved for sure. Ticket already made around: Serverlessness

@Jim-Encord Jim-Encord force-pushed the jb/object-crops-on-pdfs branch from c5f658e to e472680 Compare April 8, 2025 09:01
pyproject.toml Outdated
numpy = ">=1.26.4"
opencv-python = ">=4.1"

PyMuPdf = ">1.25.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big no no IMO. I agree with your comment that this should be made optional.
So you do pip install "encord-agents[pdf]" if you need it. encord-agents` has to be light weight in terms of dependencies. Cause what happens when we need to support DICOM or some exotic geospatial data type.

import pymupdf


def extract_page(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should fix this straight away. The pumupdf.Pixmap has a pil_image method so we can make the optional dependencies [mymupdf, PILLOW] and then use that. Then, no need to store on disk.

Gives support for pdf package and provides method for interfacing with pdfs.
Makes this package optional and slightly modify CI to ensure testing With pdf support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants