Skip to content

Commit acc4001

Browse files
authored
Merge pull request #108 from pymupdf/v0.0.11
Some fixes
2 parents a3efce3 + bff4983 commit acc4001

File tree

3 files changed

+118
-14
lines changed

3 files changed

+118
-14
lines changed

docs/src/changes.rst

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
.. include:: header.rst
2+
3+
4+
Change Log
5+
===========================================================================
6+
7+
Changes in version 0.0.11
8+
--------------------------
9+
10+
Fixes:
11+
~~~~~~~
12+
13+
* `90 <https://github.com/pymupdf/RAG/issues/90>`_ "'Quad' object has no attribute 'tl'"
14+
* `88 <https://github.com/pymupdf/RAG/issues/88>`_ "Bug in is_significant function"
15+
16+
17+
Improvements:
18+
~~~~~~~~~~~~~~
19+
* Extended the list of known bullet point characters.
20+
21+
22+
Changes in version 0.0.10
23+
--------------------------
24+
25+
Fixes:
26+
~~~~~~~
27+
28+
* `73 <https://github.com/pymupdf/RAG/issues/73>`_ "bug in to_markdown internal function"
29+
* `74 <https://github.com/pymupdf/RAG/issues/74>`_ "minimum area for images & vector graphics"
30+
* `75 <https://github.com/pymupdf/RAG/issues/75>`_ "Poor Markdown Generation for Particular PDF"
31+
* `76 <https://github.com/pymupdf/RAG/issues/76>`_ "suggestion on useful api parameters"
32+
33+
34+
Improvements:
35+
~~~~~~~~~~~~~~
36+
* Improved recognition of "insignificant" vector graphics. Graphics like text highlights or borders will be ignored.
37+
* The format of saved images can now be controlled via new parameter `image_format`.
38+
* Images can be stored in a specific folder via the new parameter `image_path`.
39+
* Images are **not stored if contained** in another image on same page.
40+
* Images are **not stored if too small:** if width or height are less than 5% of corresponding page dimension.
41+
* All text is always written. If `write_images=True`, text on images / graphics can be suppressed by setting `force_text=False`.
42+
43+
44+
Changes in version 0.0.9
45+
--------------------------
46+
47+
Fixes:
48+
~~~~~~~
49+
50+
* `71 <https://github.com/pymupdf/RAG/issues/71>`_ "Unexpected results in pymupdf4llm but pymupdf works"
51+
* `68 <https://github.com/pymupdf/RAG/issues/68>`_ "Issue with text extraction near footer of page"
52+
53+
54+
Improvements:
55+
~~~~~~~~~~~~~~
56+
* Improved identification of scattered text span particles. This should address most issues with out-of-sequence situations.
57+
* We now correctly process rotated pages (see issue #68).
58+
59+
60+
Changes in version 0.0.8
61+
--------------------------
62+
63+
Fixes:
64+
~~~~~~~
65+
66+
* `65 <https://github.com/pymupdf/RAG/issues/65>`_ Fix typo in `pymupdf_rag.py`.
67+
68+
69+
Changes in version 0.0.7
70+
--------------------------
71+
72+
Fixes:
73+
~~~~~~~
74+
75+
* `54 <https://github.com/pymupdf/RAG/issues/54>`_ "Mistakes in orchestrating sentences". Additional fix: text extraction no longer uses the TEXT_DEHYPHNATE flag bit.
76+
77+
Improvements:
78+
~~~~~~~~~~~~~~~~
79+
80+
* Improved the algorithm dealing with vector graphics. Vector graphics are now more reliably classified as irrelevant: We now detect when "strokes" only exist in the neighborhood of the graphics boundary box border itself. This is quite often the case for code snippets.
81+
82+
83+
Changes in version 0.0.6
84+
--------------------------
85+
86+
Fixes:
87+
~~~~~~~
88+
89+
* `55 <https://github.com/pymupdf/RAG/issues/55>`_ "Bug in helpers/multi_column.py - IndexError: list index out of range"
90+
* `54 <https://github.com/pymupdf/RAG/issues/54>`_ "Mistakes in orchestrating sentences"
91+
* `52 <https://github.com/pymupdf/RAG/issues/52>`_ "Chunking of text files"
92+
* Partial fix for `41 <https://github.com/pymupdf/RAG/issues/41>`_ / `40 <https://github.com/pymupdf/RAG/issues/40>`_. Improved page column detection, but still no silver bullet for overly complex page layouts.
93+
94+
Improvements:
95+
~~~~~~~~~~~~~~~~
96+
97+
* New parameter `dpi` to specify the resolution of images.
98+
* New parameters `page_width` / `page_height` for easily processing reflowable documents (Text, Office, e-books).
99+
* New parameter `graphics_limit` to avoid spending runtimes for value-less content.
100+
* New parameter `table_strategy` to directly control the table detection strategy.
101+
102+
.. include:: footer.rst
103+

pymupdf4llm/pymupdf4llm/helpers/get_text_lines.py

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,9 @@ def sanitize_spans(line):
6969
Returns:
7070
A list of sorted, and potentially cleaned-up spans
7171
"""
72-
line.sort(key=lambda s: s["bbox"].x0) # sort left to right
72+
# sort ascending horizontally
73+
line.sort(key=lambda s: s["bbox"].x0)
74+
# join spans, delete duplicates
7375
for i in range(len(line) - 1, 0, -1): # iterate back to front
7476
s0 = line[i - 1]
7577
s1 = line[i]
@@ -78,13 +80,17 @@ def sanitize_spans(line):
7880
delta = s1["size"] * 0.1
7981
if s0["bbox"].x1 + delta < s1["bbox"].x0:
8082
continue # all good: no joining neded
83+
84+
# We need to join bbox and text of two consecutive spans
85+
# On occasion, spans may also be duplicated.
86+
if s0["text"] != s1["text"] or s0["bbox"] != s1["bbox"]:
87+
s0["text"] += s1["text"]
8188
s0["bbox"] |= s1["bbox"] # join boundary boxes
82-
s0["text"] += s1["text"] # join the text
8389
del line[i] # delete the joined-in span
8490
line[i - 1] = s0 # update the span
8591
return line
8692

87-
if clip is None: # use TextPage if not provided
93+
if clip is None: # use TextPage rect if not provided
8894
clip = textpage.rect
8995
# extract text blocks - if bbox is not empty
9096
blocks = [
@@ -126,10 +132,7 @@ def sanitize_spans(line):
126132
sbbox = s["bbox"] # this bbox
127133
sbbox0 = line[-1]["bbox"] # previous bbox
128134
# if any of top or bottom coordinates are close enough, join...
129-
if (
130-
abs(sbbox.y1 - sbbox0.y1) <= y_delta
131-
or abs(sbbox.y0 - sbbox0.y0) <= y_delta
132-
):
135+
if abs(sbbox.y1 - sbbox0.y1) <= y_delta or abs(sbbox.y0 - sbbox0.y0) <= y_delta:
133136
line.append(s) # append to this line
134137
lrect |= sbbox # extend line rectangle
135138
continue
@@ -150,9 +153,7 @@ def sanitize_spans(line):
150153
return nlines
151154

152155

153-
def get_text_lines(
154-
page, *, textpage=None, clip=None, sep="\t", tolerance=3, ocr=False
155-
):
156+
def get_text_lines(page, *, textpage=None, clip=None, sep="\t", tolerance=3, ocr=False):
156157
"""Extract text by line keeping natural reading sequence.
157158
158159
Notes:

pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,15 +40,15 @@
4040
if fitz.pymupdf_version_tuple < (1, 24, 2):
4141
raise NotImplementedError("PyMuPDF version 1.24.2 or later is needed.")
4242

43-
bullet = (
43+
bullet = [
4444
"- ",
4545
"* ",
4646
chr(0xF0A7),
4747
chr(0xF0B7),
4848
chr(0xB7),
4949
chr(8226),
50-
chr(9679),
51-
)
50+
] + list(map(chr, range(9642, 9680)))
51+
5252
GRAPHICS_TEXT = "\n![](%s)\n"
5353

5454

@@ -193,7 +193,7 @@ def is_significant(box, paths):
193193
for itm in p["items"]:
194194
if itm[0] in ("l", "c"): # line or curve
195195
points.extend(itm[1:]) # append all the points
196-
elif itm[0] == "q": # quad
196+
elif itm[0] == "qu": # quad
197197
q = itm[1]
198198
# follow corners anti-clockwise
199199
points.extend([q.ul, q.ll, q.lr, q.ur, q.ul])

0 commit comments

Comments
 (0)