Addresses multiple issues

JorjMcKie · JorjMcKie · commit 2ec62aeb07fa · 2025-03-29T09:14:47.000-04:00
For details see CHANGES.md in the repository.
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,6 +1,50 @@
 # Change Log
 
 
+## Changes in version 0.0.19
+
+### Fixes:
+The following list includes fixes made in version 0.0.18 already.
+
+* [158](https://github.com/pymupdf/RAG/issues/158) - Very long titles when converting to markdown.
+* [155](https://github.com/pymupdf/RAG/issues/155) - Inconsistent image extraction from image-only PDFs
+* [161](https://github.com/pymupdf/RAG/issues/161) - force_text param ignored.
+* [162](https://github.com/pymupdf/RAG/issues/162) - to_markdown isn't outputting all the pages but get_text is.
+* [173](https://github.com/pymupdf/RAG/issues/173) - First column of table is repeated before the actual table.
+* [187](https://github.com/pymupdf/RAG/issues/187) - Unsolicited Text Particles
+* [188](https://github.com/pymupdf/RAG/issues/188) - Takes lot of time to convert into markdown.
+* [191](https://github.com/pymupdf/RAG/issues/191) - Extraction of text stops in the middle while working fine with PyMuPDF.
+* [212](https://github.com/pymupdf/RAG/issues/212) - In pymupdf4llm, if a page has multiple images, only 1 image per-page is extracted.
+* [213](https://github.com/pymupdf/RAG/issues/213) - Many ���� after converting when using pymupdf4llm
+* [215](https://github.com/pymupdf/RAG/issues/215) - Spending too much time on identifying text bboxes
+* [218](https://github.com/pymupdf/RAG/issues/218) - IndexError in get_raw_lines when processing PDFs with formulas
+* [225](https://github.com/pymupdf/RAG/issues/225) - Text with background missing from output.
+* [229](https://github.com/pymupdf/RAG/issues/229) - Duplicated Table Content on pymuPDF4LLM.
+
+
+### Other Changes:
+
+* Added **_new parameter_** `filename`: (str), optional. Overwrites or sets the filename for saved images. Useful when the document is opened from memory.
+
+* Added **_new parameter_** `use_glyphs`: (bool), optional. Request to use the glyph number (if possible) of a character if the font has no back-translation to the original Unicode value. The default is `False` which causes &#xfffd; symbols to be rendered in these cases.
+
+* Added **_strike-out support_**: We now detect and render ~~striked-out text.~~
+
+* Improved **_background color_** detection: We have introduced a simple background color detection mechanism: If a page shows an identical color in all four corners, we assume this to be the background color. Text and vector graphics with this color will be ignored as invisible.
+
+* Improved **_invisible text detection_**: Text with an alpha value of 0 is now ignored.
+
+* Improved **_fake-bold_** detection: Text mimicking bold appearance is now treated like standard bold text in most cases.
+
+* Header handling changes:
+    - Detection now happens based on the **_largest font size_** of the line.
+    - Uniformly rendered: All spans of a header line will now be rendered with the same appearance.
+
+* Changed handling of parameter `graphics_limit`: We previously ignored a page completely if the vector graphics count exceeded the limit. We now only ignore vector graphics if their count **_outside table boundary boxes_** is too large. This should only suppress vector graphics on the page, while keeping images, text and table content extractable.
+
+* Changed the `margins` default to 0. The previous default `(0, 50, 0, 50)` ignored 50 points at the top and bottom of pages. This has turned out to cause confusion in too many cases.
+
+
 ## Changes in version 0.0.17
 
 ### Fixes:
diff --git a/pdf4llm/setup.py b/pdf4llm/setup.py
@@ -13,7 +13,7 @@
     "Programming Language :: Python :: 3",
     "Topic :: Utilities",
 ]
-requires = ["pymupdf4llm>=0.0.18"]
+requires = ["pymupdf4llm>=0.0.19"]
 
 setuptools.setup(
     name="pdf4llm",
@@ -32,4 +32,10 @@
     package_data={
         "pdf4llm": ["LICENSE"],
     },
+    project_urls={
+        "Documentation": "https://pymupdf.readthedocs.io/",
+        "Source": "https://github.com/pymupdf/RAG/tree/main/pdf4llm/pdf4llm",
+        "Tracker": "https://github.com/pymupdf/RAG/issues",
+        "Changelog": "https://github.com/pymupdf/RAG/blob/main/CHANGES.md",
+    },
 )
diff --git a/pymupdf4llm/pymupdf4llm/helpers/get_text_lines.py b/pymupdf4llm/pymupdf4llm/helpers/get_text_lines.py
@@ -117,8 +117,9 @@ def sanitize_spans(line):
                 if s["flags"] & 1 == 1:  # if a superscript, modify bbox
                     # with that of the preceding or following span
                     i = 1 if sno == 0 else sno - 1
-                    neighbor = line["spans"][i]
-                    sbbox.y1 = neighbor["bbox"][3]
+                    if len(line["spans"]) > i:
+                        neighbor = line["spans"][i]
+                        sbbox.y1 = neighbor["bbox"][3]
                     s["text"] = f"[{s['text']}]"
                 s["bbox"] = sbbox  # update with the Rect version
                 # include line/block numbers to facilitate separator insertion
diff --git a/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py b/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py
@@ -453,14 +453,21 @@ def write_text(
             # Pick up tables ABOVE this text block
             # ------------------------------------------------------------
             if tables:
-                for i, _ in sorted(
+                tab_candidates = sorted(
                     [
-                        j
-                        for j in parms.tab_rects.items()
-                        if j[1].y1 <= lrect.y0 and not (j[1] & lrect).is_empty
+                        (i, tab_rect)
+                        for i, tab_rect in parms.tab_rects.items()
+                        if tab_rect.y1 <= lrect.y0
+                        and i not in parms.deleted_tables
+                        and (
+                            0
+                            or lrect.x0 <= tab_rect.x0 < lrect.x1
+                            or lrect.x0 < tab_rect.x1 <= lrect.x1
+                        )
                     ],
                     key=lambda j: (j[1].y1, j[1].x0),
-                ):
+                )
+                for i, _ in tab_candidates:
                     out_string += "\n" + parms.tabs[i].to_markdown(clean=False) + "\n"
                     if EXTRACT_WORDS:
                         # for "words" extraction, add table cells as line rects
@@ -476,7 +483,7 @@ def write_text(
                             key=lambda c: (c.y1, c.x0),
                         )
                         parms.line_rects.extend(cells)
-                    del tab_rects[i]
+                    parms.deleted_tables.append(i)
 
             # ------------------------------------------------------------
             # Pick up images / graphics ABOVE this text block
@@ -516,14 +523,18 @@ def write_text(
 
             # full line strikeout?
             all_strikeout = all([s["char_flags"] & 1 for s in spans])
+            # full line italic?
+            all_italic = all([s["flags"] & 2 for s in spans])
+            # full line bold?
+            all_bold = all([s["flags"] & 16 or s["char_flags"] & 8 for s in spans])
 
             # full line mono-spaced?
             if not IGNORE_CODE:
                 all_mono = all([s["flags"] & 8 for s in spans])
             else:
                 all_mono = False
 
-            if all_mono:
+            if all_mono and not hdr_string:
                 if not code:  # if not already in code output mode:
                     out_string += "```\n"  # switch on "code" mode
                     code = True
@@ -536,9 +547,22 @@ def write_text(
                 continue  # done with this line
 
             if hdr_string:  # if a header line skip the rest
+                if all_mono:
+                    text = "`" + text + "`"
                 if all_strikeout:
                     text = "~~" + text + "~~"
-                out_string += hdr_string + text + "\n"
+                if all_italic:
+                    text = "*" + text + "*"
+                if all_bold:
+                    text = "**" + text + "**"
+                if hdr_string != prev_hdr_string:
+                    out_string += hdr_string + text + "\n"
+                else:
+                    # intercept if header text has been broken in multiple lines
+                    while out_string.endswith("\n"):
+                        out_string = out_string[:-1]
+                    out_string += " " + text + "\n"
+                prev_hdr_string = hdr_string
                 continue
 
             span0 = spans[0]
@@ -557,15 +581,6 @@ def write_text(
                 out_string += "\n"
             prev_lrect = lrect
 
-            # intercept if header text has been broken in multiple lines
-            if hdr_string and hdr_string == prev_hdr_string:
-                while out_string.endswith("\n"):
-                    out_string = out_string[:-1]
-                out_string = out_string[:-1] + " " + text + "\n"
-                continue
-
-            prev_hdr_string = hdr_string
-
             # this line is not all-mono, so switch off "code" mode
             if code:  # in code output mode?
                 out_string += "```\n"  # switch of code mode
@@ -594,6 +609,9 @@ def write_text(
                 if strikeout:
                     prefix = "~~" + prefix
                     suffix += "~~"
+                if mono:
+                    prefix = "`" + prefix
+                    suffix += "`"
 
                 # convert intersecting link to markdown syntax
                 ltext = resolve_links(parms.links, s)
@@ -649,6 +667,8 @@ def output_tables(parms, text_rect):
                 [j for j in parms.tab_rects.items() if j[1].y1 <= text_rect.y0],
                 key=lambda j: (j[1].y1, j[1].x0),
             ):
+                if i in parms.deleted_tables:
+                    continue
                 this_md += parms.tabs[i].to_markdown(clean=False)
                 if EXTRACT_WORDS:
                     # for "words" extraction, add table cells as line rects
@@ -671,6 +691,8 @@ def output_tables(parms, text_rect):
                 parms.tab_rects.items(),
                 key=lambda j: (j[1].y1, j[1].x0),
             ):
+                if i in parms.deleted_tables:
+                    continue
                 this_md += parms.tabs[i].to_markdown(clean=False)
                 if EXTRACT_WORDS:
                     # for "words" extraction, add table cells as line rects
@@ -926,6 +948,7 @@ def get_page_output(doc, pno, margins, textflags, FILENAME):
         parms.img_rects.extend(vg_clusters0)
         parms.img_rects = sorted(set(parms.img_rects), key=lambda r: (r.y1, r.x0))
         parms.deleted_images = []
+        parms.deleted_tables = []
         # these may no longer be pairwise disjoint:
         # remove area overlaps by joining into larger rects
         parms.vg_clusters0 = refine_boxes(vg_clusters0)
diff --git a/pymupdf4llm/setup.py b/pymupdf4llm/setup.py
@@ -32,4 +32,10 @@
     package_data={
         "pymupdf4llm": ["LICENSE", "helpers/*.py", "llama/*.py"],
     },
+    project_urls={
+        "Documentation": "https://pymupdf.readthedocs.io/",
+        "Source": "https://github.com/pymupdf/RAG/tree/main/pymupdf4llm/pymupdf4llm",
+        "Tracker": "https://github.com/pymupdf/RAG/issues",
+        "Changelog": "https://github.com/pymupdf/RAG/blob/main/CHANGES.md",
+    },
 )