Skip to content

Commit b86b33f

Browse files
authored
Merge pull request #139 from pymupdf/v0.0.15
Version 0.0.15
2 parents 65130d2 + 8578c49 commit b86b33f

File tree

8 files changed

+318
-162
lines changed

8 files changed

+318
-162
lines changed

docs/src/changes.rst

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,27 @@
44
Change Log
55
===========================================================================
66

7+
Changes in version 0.0.15
8+
--------------------------
9+
10+
Fixes:
11+
~~~~~~~
12+
13+
* `138 <https://github.com/pymupdf/RAG/issues/138>`_ "Table is not extracted and some text order was wrong."
14+
* `135 <https://github.com/pymupdf/RAG/issues/135>`_ "Problem with multiple columns in simple text."
15+
* `134 <https://github.com/pymupdf/RAG/issues/134>`_ "Exclude images based on size threshold parameter."
16+
* `132 <https://github.com/pymupdf/RAG/issues/132>`_ "Optionally embed images as base64 string."
17+
* `128 <https://github.com/pymupdf/RAG/issues/128>`_ "Enhanced image embedding format."
18+
19+
20+
Improvements:
21+
~~~~~~~~~~~~~~
22+
* New parameter `embed_images` (bool) **embeds** images and vector graphics in the markdown text as base64-encoded strings. Ignores `write_images` and `image_path` parameters.
23+
* New parameter `image_size_limit` which is a float between 0 and 1, default is 0.05 (5%). Causes images to be ignored if their width or height values are smaller than the corresponding fraction of the page's width or height.
24+
* The algorithm has been improved which determins the sequence of the text rectangles on multi-column pages.
25+
* Change of the header identification algorithm: If more than six header levels are required for a document, then all text with a font size larger than body text is assumed to be a header of level 6 (i.e. HTML "h6" = "###### ").
26+
27+
728
Changes in version 0.0.13
829
--------------------------
930

@@ -19,7 +40,6 @@ Improvements:
1940
* New parameter `extract_words` enforces `page_chunks=True` and adds a "words" list to each page dictionary.
2041

2142

22-
2343
Changes in version 0.0.11
2444
--------------------------
2545

pymupdf4llm/pymupdf4llm/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from .helpers.pymupdf_rag import IdentifyHeaders, to_markdown
22

3-
__version__ = "0.0.14"
3+
__version__ = "0.0.15"
44
version = __version__
55
version_tuple = tuple(map(int, version.split(".")))
66

pymupdf4llm/pymupdf4llm/helpers/get_text_lines.py

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,8 @@ def get_raw_lines(textpage, clip=None, tolerance=3):
3434
Result is a sorted list of line objects that consist of the recomputed line
3535
boundary box and the sorted list of spans in that line.
3636
37-
This result can then easily be converted e.g. to plain or markdown text.
37+
This result can then easily be converted e.g. to plain text and other
38+
formats like Markdown or JSON.
3839
3940
Args:
4041
textpage: (mandatory) TextPage object
@@ -45,7 +46,7 @@ def get_raw_lines(textpage, clip=None, tolerance=3):
4546
4647
Returns:
4748
A sorted list of items (rect, [spans]), each representing one line. The
48-
spans are sorted left to right, Span dictionaries have been changed:
49+
spans are sorted left to right. Span dictionaries have been changed:
4950
- "bbox" has been converted to a Rect object
5051
- "line" (new) the line number in TextPage.extractDICT
5152
- "block" (new) the block number in TextPage.extractDICT
@@ -98,7 +99,7 @@ def sanitize_spans(line):
9899
spans = [] # all spans in TextPage here
99100
for bno, b in enumerate(blocks): # the numbered blocks
100101
for lno, line in enumerate(b["lines"]): # the numbered lines
101-
if abs(1-line["dir"][0]) > 1e-3: # only accept horizontal text
102+
if abs(1 - line["dir"][0]) > 1e-3: # only accept horizontal text
102103
continue
103104
for sno, s in enumerate(line["spans"]): # the numered spans
104105
sbbox = pymupdf.Rect(s["bbox"]) # span bbox as a Rect
@@ -131,7 +132,10 @@ def sanitize_spans(line):
131132
sbbox = s["bbox"] # this bbox
132133
sbbox0 = line[-1]["bbox"] # previous bbox
133134
# if any of top or bottom coordinates are close enough, join...
134-
if abs(sbbox.y1 - sbbox0.y1) <= y_delta or abs(sbbox.y0 - sbbox0.y0) <= y_delta:
135+
if (
136+
abs(sbbox.y1 - sbbox0.y1) <= y_delta
137+
or abs(sbbox.y0 - sbbox0.y0) <= y_delta
138+
):
135139
line.append(s) # append to this line
136140
lrect |= sbbox # extend line rectangle
137141
continue
@@ -152,7 +156,9 @@ def sanitize_spans(line):
152156
return nlines
153157

154158

155-
def get_text_lines(page, *, textpage=None, clip=None, sep="\t", tolerance=3, ocr=False):
159+
def get_text_lines(
160+
page, *, textpage=None, clip=None, sep="\t", tolerance=3, ocr=False
161+
):
156162
"""Extract text by line keeping natural reading sequence.
157163
158164
Notes:

0 commit comments

Comments
 (0)