Skip to content

Commit 266ee82

Browse files
authored
Merge pull request #242 from pymupdf/Version-0.0.20
Changes Version 0.0.20
2 parents 8460a9f + 05becd8 commit 266ee82

File tree

6 files changed

+113
-69
lines changed

6 files changed

+113
-69
lines changed

CHANGES.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,23 @@
11
# Change Log
22

3+
## Changes in version 0.0.20
4+
5+
### Fixes:
6+
7+
* [171](https://github.com/pymupdf/RAG/issues/171) - Text rects overlap with tables and images that should be excluded.
8+
* [189](https://github.com/pymupdf/RAG/issues/189) - The position of the extracted image is incorrect
9+
* [238](https://github.com/pymupdf/RAG/issues/238) - When text is laid out around the picture, text extraction is missing.
10+
11+
### Other Changes:
12+
13+
* Added **_new parameter_** `ignore_images`: (bool) optional. `True` will not consider images in any way. May be useful for pages where a plethora of images prevents meaningful layout analysis. Typical examples are PowerPoint slides and derived / similar pages.
14+
15+
* Added **_new parameter_** `ignore_graphics`: (bool), optional. `True` will not consider graphics except for table detection. May be useful for pages where a plethora of vector graphics prevents meaningful layout analysis. Typical examples are PowerPoint slides and derived / similar pages.
16+
17+
* Added **_new parameter_** to class `IdentifyHeaders`: Use `max_levels` (integer <= 6) to limit the generation of header tag levels. e.g. `headers = pymupdf4llm.IdentifyHeaders(doc, max_level=3)` ensures that only up to 3 header levels will ever be generated. Any text with a font size less than the value of `###` will be body text. In this case, the markdown generation itself would be coded as `md = pymupdf4llm.to_markdown(doc, hdr_info=headers, ...)`.
18+
19+
* Changed parameter `table_strategy`: When specifying `None`, no effort to detecting tables will be made. This can be useful when tables are of no interest or known to not exist in a given file. This will speed up processing significantly. Be prepared to see more changes and extensions here.
20+
321

422
## Changes in version 0.0.19
523

pdf4llm/setup.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,11 @@
1313
"Programming Language :: Python :: 3",
1414
"Topic :: Utilities",
1515
]
16-
requires = ["pymupdf4llm>=0.0.19"]
16+
requires = ["pymupdf4llm==0.0.20"]
1717

1818
setuptools.setup(
1919
name="pdf4llm",
20-
version="0.0.19",
20+
version="0.0.20",
2121
author="Artifex",
2222
author_email="[email protected]",
2323
description="PyMuPDF Utilities for LLM/RAG",

pymupdf4llm/pymupdf4llm/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from .helpers.pymupdf_rag import IdentifyHeaders, to_markdown
22

3-
__version__ = "0.0.19"
3+
__version__ = "0.0.20"
44
version = __version__
55
version_tuple = tuple(map(int, version.split(".")))
66

pymupdf4llm/pymupdf4llm/helpers/multi_column.py

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ def column_boxes(
7676
textpage=None,
7777
paths=None,
7878
avoid=None,
79+
ignore_images=False,
7980
):
8081
"""Determine bboxes which wrap a column on the page.
8182
@@ -261,7 +262,9 @@ def join_rects_phase3(bboxes, path_rects, cache):
261262
continue
262263

263264
# do not join different backgrounds
264-
if in_bbox_using_cache(prect0, path_rects, cache) != in_bbox_using_cache(prect1, path_rects, cache):
265+
if in_bbox_using_cache(
266+
prect0, path_rects, cache
267+
) != in_bbox_using_cache(prect1, path_rects, cache):
265268
continue
266269
temp = prect0 | prect1
267270
test = set(
@@ -333,11 +336,12 @@ def join_rects_phase3(bboxes, path_rects, cache):
333336
clip.y1 -= footer_margin # Remove footer area
334337
clip.y0 += header_margin # Remove header area
335338

336-
paths = [
337-
p
338-
for p in page.get_drawings()
339-
if p["rect"].width < clip.width and p["rect"].height < clip.height
340-
]
339+
if paths is None:
340+
paths = [
341+
p
342+
for p in page.get_drawings()
343+
if p["rect"].width < clip.width and p["rect"].height < clip.height
344+
]
341345

342346
if textpage is None:
343347
textpage = page.get_textpage(clip=clip, flags=pymupdf.TEXTFLAGS_TEXT)
@@ -371,8 +375,9 @@ def join_rects_phase3(bboxes, path_rects, cache):
371375
path_rects.sort(key=lambda b: (b.y0, b.x0))
372376

373377
# bboxes of images on page, no need to sort them
374-
for item in page.get_images():
375-
img_bboxes.extend(page.get_image_rects(item[0]))
378+
if ignore_images is False:
379+
for item in page.get_images():
380+
img_bboxes.extend(page.get_image_rects(item[0]))
376381

377382
# blocks of text on page
378383
blocks = textpage.extractDICT()["blocks"]
@@ -433,7 +438,9 @@ def join_rects_phase3(bboxes, path_rects, cache):
433438
continue
434439

435440
# never join across different background colors
436-
if in_bbox_using_cache(nbb, path_rects, cache) != in_bbox_using_cache(bb, path_rects, cache):
441+
if in_bbox_using_cache(nbb, path_rects, cache) != in_bbox_using_cache(
442+
bb, path_rects, cache
443+
):
437444
continue
438445

439446
temp = bb | nbb # temporary extension of new block

0 commit comments

Comments
 (0)