Skip to content

Commit 2ec62ae

Browse files
committed
Addresses multiple issues
For details see CHANGES.md in the repository.
1 parent 3ad7edf commit 2ec62ae

File tree

5 files changed

+100
-20
lines changed

5 files changed

+100
-20
lines changed

CHANGES.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,50 @@
11
# Change Log
22

33

4+
## Changes in version 0.0.19
5+
6+
### Fixes:
7+
The following list includes fixes made in version 0.0.18 already.
8+
9+
* [158](https://github.com/pymupdf/RAG/issues/158) - Very long titles when converting to markdown.
10+
* [155](https://github.com/pymupdf/RAG/issues/155) - Inconsistent image extraction from image-only PDFs
11+
* [161](https://github.com/pymupdf/RAG/issues/161) - force_text param ignored.
12+
* [162](https://github.com/pymupdf/RAG/issues/162) - to_markdown isn't outputting all the pages but get_text is.
13+
* [173](https://github.com/pymupdf/RAG/issues/173) - First column of table is repeated before the actual table.
14+
* [187](https://github.com/pymupdf/RAG/issues/187) - Unsolicited Text Particles
15+
* [188](https://github.com/pymupdf/RAG/issues/188) - Takes lot of time to convert into markdown.
16+
* [191](https://github.com/pymupdf/RAG/issues/191) - Extraction of text stops in the middle while working fine with PyMuPDF.
17+
* [212](https://github.com/pymupdf/RAG/issues/212) - In pymupdf4llm, if a page has multiple images, only 1 image per-page is extracted.
18+
* [213](https://github.com/pymupdf/RAG/issues/213) - Many ���� after converting when using pymupdf4llm
19+
* [215](https://github.com/pymupdf/RAG/issues/215) - Spending too much time on identifying text bboxes
20+
* [218](https://github.com/pymupdf/RAG/issues/218) - IndexError in get_raw_lines when processing PDFs with formulas
21+
* [225](https://github.com/pymupdf/RAG/issues/225) - Text with background missing from output.
22+
* [229](https://github.com/pymupdf/RAG/issues/229) - Duplicated Table Content on pymuPDF4LLM.
23+
24+
25+
### Other Changes:
26+
27+
* Added **_new parameter_** `filename`: (str), optional. Overwrites or sets the filename for saved images. Useful when the document is opened from memory.
28+
29+
* Added **_new parameter_** `use_glyphs`: (bool), optional. Request to use the glyph number (if possible) of a character if the font has no back-translation to the original Unicode value. The default is `False` which causes � symbols to be rendered in these cases.
30+
31+
* Added **_strike-out support_**: We now detect and render ~~striked-out text.~~
32+
33+
* Improved **_background color_** detection: We have introduced a simple background color detection mechanism: If a page shows an identical color in all four corners, we assume this to be the background color. Text and vector graphics with this color will be ignored as invisible.
34+
35+
* Improved **_invisible text detection_**: Text with an alpha value of 0 is now ignored.
36+
37+
* Improved **_fake-bold_** detection: Text mimicking bold appearance is now treated like standard bold text in most cases.
38+
39+
* Header handling changes:
40+
- Detection now happens based on the **_largest font size_** of the line.
41+
- Uniformly rendered: All spans of a header line will now be rendered with the same appearance.
42+
43+
* Changed handling of parameter `graphics_limit`: We previously ignored a page completely if the vector graphics count exceeded the limit. We now only ignore vector graphics if their count **_outside table boundary boxes_** is too large. This should only suppress vector graphics on the page, while keeping images, text and table content extractable.
44+
45+
* Changed the `margins` default to 0. The previous default `(0, 50, 0, 50)` ignored 50 points at the top and bottom of pages. This has turned out to cause confusion in too many cases.
46+
47+
448
## Changes in version 0.0.17
549

650
### Fixes:

pdf4llm/setup.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
"Programming Language :: Python :: 3",
1414
"Topic :: Utilities",
1515
]
16-
requires = ["pymupdf4llm>=0.0.18"]
16+
requires = ["pymupdf4llm>=0.0.19"]
1717

1818
setuptools.setup(
1919
name="pdf4llm",
@@ -32,4 +32,10 @@
3232
package_data={
3333
"pdf4llm": ["LICENSE"],
3434
},
35+
project_urls={
36+
"Documentation": "https://pymupdf.readthedocs.io/",
37+
"Source": "https://github.com/pymupdf/RAG/tree/main/pdf4llm/pdf4llm",
38+
"Tracker": "https://github.com/pymupdf/RAG/issues",
39+
"Changelog": "https://github.com/pymupdf/RAG/blob/main/CHANGES.md",
40+
},
3541
)

pymupdf4llm/pymupdf4llm/helpers/get_text_lines.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -117,8 +117,9 @@ def sanitize_spans(line):
117117
if s["flags"] & 1 == 1: # if a superscript, modify bbox
118118
# with that of the preceding or following span
119119
i = 1 if sno == 0 else sno - 1
120-
neighbor = line["spans"][i]
121-
sbbox.y1 = neighbor["bbox"][3]
120+
if len(line["spans"]) > i:
121+
neighbor = line["spans"][i]
122+
sbbox.y1 = neighbor["bbox"][3]
122123
s["text"] = f"[{s['text']}]"
123124
s["bbox"] = sbbox # update with the Rect version
124125
# include line/block numbers to facilitate separator insertion

pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py

Lines changed: 40 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -453,14 +453,21 @@ def write_text(
453453
# Pick up tables ABOVE this text block
454454
# ------------------------------------------------------------
455455
if tables:
456-
for i, _ in sorted(
456+
tab_candidates = sorted(
457457
[
458-
j
459-
for j in parms.tab_rects.items()
460-
if j[1].y1 <= lrect.y0 and not (j[1] & lrect).is_empty
458+
(i, tab_rect)
459+
for i, tab_rect in parms.tab_rects.items()
460+
if tab_rect.y1 <= lrect.y0
461+
and i not in parms.deleted_tables
462+
and (
463+
0
464+
or lrect.x0 <= tab_rect.x0 < lrect.x1
465+
or lrect.x0 < tab_rect.x1 <= lrect.x1
466+
)
461467
],
462468
key=lambda j: (j[1].y1, j[1].x0),
463-
):
469+
)
470+
for i, _ in tab_candidates:
464471
out_string += "\n" + parms.tabs[i].to_markdown(clean=False) + "\n"
465472
if EXTRACT_WORDS:
466473
# for "words" extraction, add table cells as line rects
@@ -476,7 +483,7 @@ def write_text(
476483
key=lambda c: (c.y1, c.x0),
477484
)
478485
parms.line_rects.extend(cells)
479-
del tab_rects[i]
486+
parms.deleted_tables.append(i)
480487

481488
# ------------------------------------------------------------
482489
# Pick up images / graphics ABOVE this text block
@@ -516,14 +523,18 @@ def write_text(
516523

517524
# full line strikeout?
518525
all_strikeout = all([s["char_flags"] & 1 for s in spans])
526+
# full line italic?
527+
all_italic = all([s["flags"] & 2 for s in spans])
528+
# full line bold?
529+
all_bold = all([s["flags"] & 16 or s["char_flags"] & 8 for s in spans])
519530

520531
# full line mono-spaced?
521532
if not IGNORE_CODE:
522533
all_mono = all([s["flags"] & 8 for s in spans])
523534
else:
524535
all_mono = False
525536

526-
if all_mono:
537+
if all_mono and not hdr_string:
527538
if not code: # if not already in code output mode:
528539
out_string += "```\n" # switch on "code" mode
529540
code = True
@@ -536,9 +547,22 @@ def write_text(
536547
continue # done with this line
537548

538549
if hdr_string: # if a header line skip the rest
550+
if all_mono:
551+
text = "`" + text + "`"
539552
if all_strikeout:
540553
text = "~~" + text + "~~"
541-
out_string += hdr_string + text + "\n"
554+
if all_italic:
555+
text = "*" + text + "*"
556+
if all_bold:
557+
text = "**" + text + "**"
558+
if hdr_string != prev_hdr_string:
559+
out_string += hdr_string + text + "\n"
560+
else:
561+
# intercept if header text has been broken in multiple lines
562+
while out_string.endswith("\n"):
563+
out_string = out_string[:-1]
564+
out_string += " " + text + "\n"
565+
prev_hdr_string = hdr_string
542566
continue
543567

544568
span0 = spans[0]
@@ -557,15 +581,6 @@ def write_text(
557581
out_string += "\n"
558582
prev_lrect = lrect
559583

560-
# intercept if header text has been broken in multiple lines
561-
if hdr_string and hdr_string == prev_hdr_string:
562-
while out_string.endswith("\n"):
563-
out_string = out_string[:-1]
564-
out_string = out_string[:-1] + " " + text + "\n"
565-
continue
566-
567-
prev_hdr_string = hdr_string
568-
569584
# this line is not all-mono, so switch off "code" mode
570585
if code: # in code output mode?
571586
out_string += "```\n" # switch of code mode
@@ -594,6 +609,9 @@ def write_text(
594609
if strikeout:
595610
prefix = "~~" + prefix
596611
suffix += "~~"
612+
if mono:
613+
prefix = "`" + prefix
614+
suffix += "`"
597615

598616
# convert intersecting link to markdown syntax
599617
ltext = resolve_links(parms.links, s)
@@ -649,6 +667,8 @@ def output_tables(parms, text_rect):
649667
[j for j in parms.tab_rects.items() if j[1].y1 <= text_rect.y0],
650668
key=lambda j: (j[1].y1, j[1].x0),
651669
):
670+
if i in parms.deleted_tables:
671+
continue
652672
this_md += parms.tabs[i].to_markdown(clean=False)
653673
if EXTRACT_WORDS:
654674
# for "words" extraction, add table cells as line rects
@@ -671,6 +691,8 @@ def output_tables(parms, text_rect):
671691
parms.tab_rects.items(),
672692
key=lambda j: (j[1].y1, j[1].x0),
673693
):
694+
if i in parms.deleted_tables:
695+
continue
674696
this_md += parms.tabs[i].to_markdown(clean=False)
675697
if EXTRACT_WORDS:
676698
# for "words" extraction, add table cells as line rects
@@ -926,6 +948,7 @@ def get_page_output(doc, pno, margins, textflags, FILENAME):
926948
parms.img_rects.extend(vg_clusters0)
927949
parms.img_rects = sorted(set(parms.img_rects), key=lambda r: (r.y1, r.x0))
928950
parms.deleted_images = []
951+
parms.deleted_tables = []
929952
# these may no longer be pairwise disjoint:
930953
# remove area overlaps by joining into larger rects
931954
parms.vg_clusters0 = refine_boxes(vg_clusters0)

pymupdf4llm/setup.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,4 +32,10 @@
3232
package_data={
3333
"pymupdf4llm": ["LICENSE", "helpers/*.py", "llama/*.py"],
3434
},
35+
project_urls={
36+
"Documentation": "https://pymupdf.readthedocs.io/",
37+
"Source": "https://github.com/pymupdf/RAG/tree/main/pymupdf4llm/pymupdf4llm",
38+
"Tracker": "https://github.com/pymupdf/RAG/issues",
39+
"Changelog": "https://github.com/pymupdf/RAG/blob/main/CHANGES.md",
40+
},
3541
)

0 commit comments

Comments
 (0)