Skip to content

Commit 57e66c4

Browse files
authored
Merge pull request #310 from pymupdf/README-update
Updates README.
2 parents b7e6e90 + aef485d commit 57e66c4

File tree

1 file changed

+26
-16
lines changed

1 file changed

+26
-16
lines changed

README.md

Lines changed: 26 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,28 @@
1-
# Using PyMuPDF in an RAG (Retrieval-Augmented Generation) Chatbot Environment
1+
# PyMuPDF4LLM
22

3-
This repository contains examples showing how PyMuPDF can be used as a data feed for RAG-based chatbots.
3+
PyMuPDF4LLM is a specialized extension of PyMuPDF designed specifically for extracting content from PDFs in a format that's optimized for Large Language Models (LLMs).
44

5-
Examples include scripts that start chatbots - either as simple CLI programs in REPL mode or browser-based GUIs.
6-
Chatbot scripts follow this general structure:
5+
## Key Features
76

8-
1. **Extract Text**: Use PyMuPDF to extract text from one or more pages from one or more PDFs. Depending on the specific requirement this may be all text or only text contained in tables, the Table of Contents, etc.
9-
This will generally be implemented as one or more Python functions called by any of the following events - which implement the actual chatbot functionality.
10-
2. **Indexing the Extracted Text**: Index the extracted text for efficient retrieval. This index will act as the knowledge base for the chatbot.
11-
3. **Query Processing**: When a user asks a question, process the query to determine the key information needed for a response.
12-
4. **Retrieving Relevant Information**: Search your indexed knowledge base for the most relevant pieces of information related to the user's query.
13-
5. **Generating a Response**: Use a generative model to generate a response based on the retrieved information.
7+
1. Markdown Output
148

15-
# Installation
9+
- Converts PDFs to clean, structured Markdown format
10+
- Preserves document hierarchy (headers, lists, tables)
11+
- Makes PDF content easily digestible for LLMs like Claude, GPT, etc.
12+
13+
2. Intelligent Structure Detection
14+
15+
- Automatically identifies headers, paragraphs, tables, and images
16+
- Maintains document layout and reading order
17+
- Preserves semantic structure
18+
19+
3. Image Handling
20+
21+
- Extracts images from PDFs
22+
- Can save images separately or encode them inline
23+
- Useful for multimodal LLMs that can process images
24+
25+
## Installation
1626

1727
The Python package on PyPI [pymupdf4llm](https://pypi.org/project/pymupdf4llm/) (there also is an alias [pdf4llm](https://pypi.org/project/pdf4llm/)) is capable of converting PDF pages into **_text strings in Markdown format_** (GitHub compatible). This includes **standard text** as well as **table-based text** in a consistent and integrated view - a feature particularly important in RAG settings.
1828

@@ -42,27 +52,27 @@ To create small **chunks of text** - as opposed to generating one large string f
4252

4353
Also new in version 0.0.2 is the optional **extraction of images** and vector graphics: use of parameter `write_images=True`. The will store PNG images in the document's folder, and the Markdown text will appropriately refer to them. The images are named like `"input.pdf-page_number-index.png"`.
4454

45-
# Documentation and API
55+
## Documentation and API
4656

4757
[Documentation](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html)
4858

4959
[API](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/api.html#pymupdf4llm-api)
5060

51-
# Document Support
61+
## Document Support
5262

5363
While PDF is by far the most important document format worldwide, it is worthwhile mentioning that all examples and helper scripts work in the same way and **_without change_** for [all supported file types](https://pymupdf.readthedocs.io/en/latest/how-to-open-a-file.html#supported-file-types).
5464

5565
So for an XPS document or an eBook, simply provide the filename for instance as `"input.mobi"` and everything else will work as before.
5666

5767

58-
# About PyMuPDF
68+
## About PyMuPDF
5969
**PyMuPDF** adds **Python** bindings and abstractions to [MuPDF](https://mupdf.com/), a lightweight **PDF**, **XPS**, and **eBook** viewer, renderer, and toolkit. Both **PyMuPDF** and **MuPDF** are maintained and developed by [Artifex Software, Inc](https://artifex.com).
6070

6171
PyMuPDF's homepage is located on [GitHub](https://github.com/pymupdf/PyMuPDF).
6272

63-
# Community
73+
## Community
6474
Join us on **Discord** here: [#pymupdf](https://discord.gg/TSpYGBW4eq).
6575

66-
# License and Copyright
76+
## License and Copyright
6777
**PyMuPDF** is available under [open-source AGPL](https://www.gnu.org/licenses/agpl-3.0.html) and commercial license agreements. If you determine you cannot meet the requirements of the **AGPL**, please contact [Artifex](https://artifex.com/contact/pymupdf-inquiry.php) for more information regarding a commercial license.
6878

0 commit comments

Comments
 (0)