RAG should be add-in for multipage document pdf #133

doncat99 · 2024-12-11T07:09:26Z

doncat99
Dec 11, 2024

instructor.exceptions.InstructorRetryException: litellm.BadRequestError: litellm.ContextWindowExceededError: ContextWindowExceededError: OpenAIException - Error code: 400 - {'error': {'message': "openai error: Invalid 'messages[1].content': string too long. Expected a string with maximum length 1048576, but got a string with length 1976458 instead. (request id: 2024121114571845225027487090505) (request id: 2024121114572763326430369052749) (request id: 2024121114572718113572229234539) (request id: 2024121114572676525390052321699) (request id: 202412111457268173112210109997) (request id: 2024121114572460819032852968468)", 'type': 'invalid_request_error', 'param': 'messages[1].content', 'code': 'string_above_max_length'}}

enoch3712 · 2024-12-11T11:15:26Z

enoch3712
Dec 11, 2024
Maintainer

Hello @doncat99 !

Im finishing this today, advanced mapping , that will be able to extract beyond context window and token count.

Is this the exception correct? I have that fixed in a few hours

Now, RAG will not be added. There is plenty of RAG librabries out there, this project will evolve into vertical agents.

0 replies

doncat99 · 2024-12-12T06:23:58Z

doncat99
Dec 12, 2024
Author

Great Job! @enoch3712

I will try it out and give you feedback ASAP.

0 replies

doncat99 · 2024-12-12T10:02:15Z

doncat99
Dec 12, 2024
Author

I just made a temp patch logic to have my 600-page PDF file accepted by Openai API.

I cut the 600 pages into pieces and analyzed them from chapters to sections.

The benefit is that, if I have a complex multi-layer data model framework, the LLM can keep focusing and not miss data.

nevertheless, I think advanced mapping will make it much easier and faster.

    def save_pages_to_new_pdf(input_pdf_path, output_pdf_path, page_numbers):
        """
        Extracts specified pages from a PDF and saves them into a new PDF.

        Args:
            input_pdf_path (str): Path to the original PDF file.
            output_pdf_path (str): Path to save the new PDF file.
            page_numbers (list of int): List of page numbers to extract (0-indexed).
        """
        # Load the original PDF
        reader = PdfReader(input_pdf_path)
        writer = PdfWriter()

        # Add specified pages to the writer
        for page_number in page_numbers:
            if 0 <= page_number < len(reader.pages):  # Ensure the page number is valid
                writer.add_page(reader.pages[page_number])
            else:
                print(f"Page number {page_number + 1} is out of range.")

        # Write to the new PDF
        with open(output_pdf_path, "wb") as output_pdf:
            writer.write(output_pdf)
        print(f"New PDF saved to: {output_pdf_path}")

0 replies

enoch3712 · 2024-12-12T12:12:26Z

enoch3712
Dec 12, 2024
Maintainer

@doncat99 Yes, thank you!

I do exactly the same, with the strategy PAGING. The problem is now is in the similar fields, where i need to choose the ambigous. Works but im still testing

I ping you once is pushed

0 replies

enoch3712 · 2024-12-13T13:18:13Z

enoch3712
Dec 13, 2024
Maintainer

hey @doncat99

Sorry for the delay, but i pull it off now after a couple of good tests .

#119

As you can see is a big one, but has a problem that im going to solve now, is that the documentLoader.load is not universal, so i need to take care of that.

Works with Tesseract and a couple of other, but not with others like pypdf

0 replies

enoch3712 · 2024-12-13T13:46:52Z

enoch3712
Dec 13, 2024
Maintainer

I merged, not published, but its here 87-advanced-mapping

I need everything in main so i can make the changes you pointed out easily

0 replies

doncat99 · 2024-12-14T12:41:00Z

doncat99
Dec 14, 2024
Author

wow, great progress! yesterday i was thinking if the extractor could retrieve all images, diagrams from pdf file. then i can call up image model to analyse it. i m still working on it, i m confusing on how i could write the correct Contract data model. and directly setting vision mode just mess all up.

0 replies

enoch3712 · 2024-12-14T19:25:23Z

enoch3712
Dec 14, 2024
Maintainer

@doncat99

https://enoch3712.github.io/ExtractThinker/examples/image-chart-processing/

what you think about this?

But thats the way yes :), make that abstractio with images

0 replies

doncat99 · 2024-12-16T04:54:38Z

doncat99
Dec 16, 2024
Author

@doncat99

https://enoch3712.github.io/ExtractThinker/examples/image-chart-processing/

what you think about this?

But thats the way yes :), make that abstractio with images

There are two approaches to completing the task (image retrieval).

One is a top-down approach, as you kindly shared in the example. We use an image loader to analyze every page image. The advantage is that it is universal to the majority of cases. A weakness is that it causes a lot (time and tokens). Besides, it is easier to lose focus, extracting irrelevant images (I am not testing yet but based on my experience it is very depending on the pdf file content itself, the content may mislead the model no matter how optimized the prompt string has indicated).

The other way is a bottom-up approach, we analyze the document coarse-to-fine. what I am doing is adding a field at DocumentContract, ChapterContract, or even at SectionContract to ask the text model if the page contains (pie, chart, bar, or other target image), then we start a background task to introduce image-model to do it work. The weakness is that the approach is strongly task-oriented.

0 replies

enoch3712 · 2024-12-16T11:09:16Z

enoch3712
Dec 16, 2024
Maintainer

what I am doing is adding a field at DocumentContract, ChapterContract, or even at SectionContract to ask the text model if the page contains (pie, chart, bar, or other target image), then we start a background task to introduce image-model to do it work. The weakness is that the approach is strongly task-oriented.

This is a great idea, and i already have similar approaches (tree classification). I never did that because the models i tend to use dont struggle with this issues since its usually SOTA models, but makes perfect sense.

Im going to add that as a feature that you can described. Could be a strategy that you pick

0 replies

enoch3712 · 2024-12-16T11:28:39Z

enoch3712
Dec 16, 2024
Maintainer

@doncat99

Ok! I really love this impementation! Tell me what you think:

#121

In later implementation i will eventually be able to split and pass dedicated images and not the entire page. This was done in Docling, i may be able to implement something similar here

0 replies

doncat99 · 2024-12-17T08:29:13Z

doncat99
Dec 17, 2024
Author

Tell me what you think:

I see the bright future of this project. The top-level design may evolve from SOTA model extraction to an intelligent task flow pipeline. By indicating a (data model -> to do) relationship, such as an image extracted from a contract model, a sub-task will call up the image model to have the image summarized(predefined action 1) or re-generate(predefined action 2), etc.

This required a new description framework, followed by a capability declaration schema.

The balance between user-customized and model self-understanding is a big-thinking job to the above new framework and schema design.

****** OPENAI's UNDERSTANDING *******

From SOTA Model Extraction to an Intelligent Task Flow Pipeline
You aim to evolve from a state-of-the-art (SOTA) model primarily focused on extraction to an intelligent task flow pipeline. This pipeline orchestrates tasks intelligently based on predefined relationships and actions.

Key Idea:

You introduce a relationship between the data model (e.g., an extracted entity) and a "to-do" task.
For example, if an image is extracted under a contract model, the system can decide what to do with it:
Predefined Action 1: Summarize the image content (e.g., OCR + text summary).
Predefined Action 2: Re-generate the image in a new format (e.g., graph re-creation or enhancement).
The "intelligent task flow" determines the next step dynamically based on the extracted data type and predefined actions.

New Description Framework
This framework will describe the relationship between:

Data Model: Represents the data entities or assets extracted (e.g., images, graphs, sections of text).
Actions/Tasks: Predefined operations that can be performed on the data model.
For example, the description could look like this:

json
Copy code
{
"data_model": "ImageExtracted",
"to_do": [
{"action": "summarize", "description": "Summarize the image content using an image-to-text model"},
{"action": "regenerate", "description": "Recreate the image with updated formatting"}
]
}
This description framework acts as the instruction layer for the intelligent pipeline, enabling the system to decide which model or process to call based on the data_model and its associated actions.

Capability Declaration Schema
You need a schema where:

Each model or tool can declare its capabilities (what actions it supports and the inputs/outputs it expects).
The system dynamically matches the data model (and its required actions) with the available model/tool capabilities.
For example:

json
Copy code
{
"model": "ImageSummarizationModel",
"capabilities": [
{
"action": "summarize",
"input": "image",
"output": "text_summary"
},
{
"action": "regenerate",
"input": "image",
"output": "new_image"
}
]
}
With this schema, the intelligent pipeline can:

Match the required action (e.g., "summarize") to the model that declares support for it.
Pass the necessary inputs and retrieve the outputs.
4. Balance Between User-Customized and Model Self-Understanding
This is the core design challenge:

User-Customized: Allow the user to define tasks, workflows, and actions explicitly, providing maximum flexibility.
Model Self-Understanding: Enable the pipeline or the models themselves to infer the appropriate tasks/actions based on the input context, reducing the need for user-defined workflows.
The balance requires:

A clear framework to define tasks and relationships (for customization).
An adaptive task scheduler or reasoning engine (e.g., powered by an LLM) that dynamically understands and chooses actions when the workflow isn't explicitly defined.
For instance:

If the system sees an image extracted, it could infer:
Default action = "summarize" if no task is defined.
Custom action = "regenerate" if explicitly requested by the user.
Summary
Task Flow Pipeline: Intelligent pipeline connecting data models to actions.
Description Framework: Defines the relationship between extracted entities (data model) and associated tasks.
Capability Declaration Schema: Models/tools declare what tasks they can perform and their input/output.
Balance Challenge: Combine user-defined workflows with model-based self-understanding for flexible and dynamic task execution.
This approach lays the foundation for an intelligent, orchestrated AI task pipeline that can evolve and adapt to new tasks while maintaining a structured framework.

0 replies

enoch3712 · 2024-12-20T13:26:38Z

enoch3712
Dec 20, 2024
Maintainer

sorry for the delay @doncat99

Just launched the 0.0.28v, very close to a beta release at this point.

I will close this because was attached to a bug, but we can discuss it here. #132

1 reply

enoch3712 Dec 20, 2024
Maintainer

@doncat99

I transfered to discussion, much better.

Now, thank you for your amazing comment. Thats indeed the evolution of the project! Now this will evolve into a vertical/autonomous agent service for B2B (i think you can see that).

Truth been said, what you described will fit in the OSS part, im still not sure yet.

I gave it a lot of tought, and what you mention will be a lot of things added as annotations and so on.

Task Flow Pipeline: Its process already. I can change the name

Let me think about what you post, i will put the idea implemenatation idea here like i did before

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG should be add-in for multipage document pdf #133

{{title}}

Replies: 13 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

RAG should be add-in for multipage document pdf #133

doncat99 Dec 11, 2024

Replies: 13 comments · 1 reply

enoch3712 Dec 11, 2024 Maintainer

doncat99 Dec 12, 2024 Author

doncat99 Dec 12, 2024 Author

enoch3712 Dec 12, 2024 Maintainer

enoch3712 Dec 13, 2024 Maintainer

enoch3712 Dec 13, 2024 Maintainer

doncat99 Dec 14, 2024 Author

enoch3712 Dec 14, 2024 Maintainer

doncat99 Dec 16, 2024 Author

enoch3712 Dec 16, 2024 Maintainer

enoch3712 Dec 16, 2024 Maintainer

doncat99 Dec 17, 2024 Author

enoch3712 Dec 20, 2024 Maintainer

enoch3712 Dec 20, 2024 Maintainer

doncat99
Dec 11, 2024

Replies: 13 comments 1 reply

enoch3712
Dec 11, 2024
Maintainer

doncat99
Dec 12, 2024
Author

doncat99
Dec 12, 2024
Author

enoch3712
Dec 12, 2024
Maintainer

enoch3712
Dec 13, 2024
Maintainer

enoch3712
Dec 13, 2024
Maintainer

doncat99
Dec 14, 2024
Author

enoch3712
Dec 14, 2024
Maintainer

doncat99
Dec 16, 2024
Author

enoch3712
Dec 16, 2024
Maintainer

enoch3712
Dec 16, 2024
Maintainer

doncat99
Dec 17, 2024
Author

enoch3712
Dec 20, 2024
Maintainer

enoch3712 Dec 20, 2024
Maintainer