RAG should be add-in for multipage document pdf #133
Replies: 13 comments 1 reply
-
Hello @doncat99 ! Im finishing this today, advanced mapping , that will be able to extract beyond context window and token count. Is this the exception correct? I have that fixed in a few hours Now, RAG will not be added. There is plenty of RAG librabries out there, this project will evolve into vertical agents. |
Beta Was this translation helpful? Give feedback.
-
Great Job! @enoch3712 I will try it out and give you feedback ASAP. |
Beta Was this translation helpful? Give feedback.
-
I just made a temp patch logic to have my 600-page PDF file accepted by Openai API. I cut the 600 pages into pieces and analyzed them from chapters to sections. The benefit is that, if I have a complex multi-layer data model framework, the LLM can keep focusing and not miss data. nevertheless, I think advanced mapping will make it much easier and faster.
|
Beta Was this translation helpful? Give feedback.
-
@doncat99 Yes, thank you! I do exactly the same, with the strategy PAGING. The problem is now is in the similar fields, where i need to choose the ambigous. Works but im still testing I ping you once is pushed |
Beta Was this translation helpful? Give feedback.
-
hey @doncat99 Sorry for the delay, but i pull it off now after a couple of good tests . As you can see is a big one, but has a problem that im going to solve now, is that the documentLoader.load is not universal, so i need to take care of that. Works with Tesseract and a couple of other, but not with others like pypdf |
Beta Was this translation helpful? Give feedback.
-
I merged, not published, but its here 87-advanced-mapping I need everything in main so i can make the changes you pointed out easily |
Beta Was this translation helpful? Give feedback.
-
wow, great progress! yesterday i was thinking if the extractor could retrieve all images, diagrams from pdf file. then i can call up image model to analyse it. i m still working on it, i m confusing on how i could write the correct Contract data model. and directly setting vision mode just mess all up. |
Beta Was this translation helpful? Give feedback.
-
https://enoch3712.github.io/ExtractThinker/examples/image-chart-processing/ what you think about this? But thats the way yes :), make that abstractio with images |
Beta Was this translation helpful? Give feedback.
-
There are two approaches to completing the task (image retrieval). One is a top-down approach, as you kindly shared in the example. We use an image loader to analyze every page image. The advantage is that it is universal to the majority of cases. A weakness is that it causes a lot (time and tokens). Besides, it is easier to lose focus, extracting irrelevant images (I am not testing yet but based on my experience it is very depending on the pdf file content itself, the content may mislead the model no matter how optimized the prompt string has indicated). The other way is a bottom-up approach, we analyze the document coarse-to-fine. what I am doing is adding a field at DocumentContract, ChapterContract, or even at SectionContract to ask the text model if the page contains (pie, chart, bar, or other target image), then we start a background task to introduce image-model to do it work. The weakness is that the approach is strongly task-oriented. |
Beta Was this translation helpful? Give feedback.
-
This is a great idea, and i already have similar approaches (tree classification). I never did that because the models i tend to use dont struggle with this issues since its usually SOTA models, but makes perfect sense. Im going to add that as a feature that you can described. Could be a strategy that you pick |
Beta Was this translation helpful? Give feedback.
-
Ok! I really love this impementation! Tell me what you think: In later implementation i will eventually be able to split and pass dedicated images and not the entire page. This was done in Docling, i may be able to implement something similar here |
Beta Was this translation helpful? Give feedback.
-
I see the bright future of this project. The top-level design may evolve from SOTA model extraction to an intelligent task flow pipeline. By indicating a (data model -> to do) relationship, such as an image extracted from a contract model, a sub-task will call up the image model to have the image summarized(predefined action 1) or re-generate(predefined action 2), etc. This required a new description framework, followed by a capability declaration schema. The balance between user-customized and model self-understanding is a big-thinking job to the above new framework and schema design. ****** OPENAI's UNDERSTANDING *******
Key Idea: You introduce a relationship between the data model (e.g., an extracted entity) and a "to-do" task.
Data Model: Represents the data entities or assets extracted (e.g., images, graphs, sections of text). json
Each model or tool can declare its capabilities (what actions it supports and the inputs/outputs it expects). json Match the required action (e.g., "summarize") to the model that declares support for it. User-Customized: Allow the user to define tasks, workflows, and actions explicitly, providing maximum flexibility. A clear framework to define tasks and relationships (for customization). If the system sees an image extracted, it could infer: |
Beta Was this translation helpful? Give feedback.
-
sorry for the delay @doncat99 Just launched the 0.0.28v, very close to a beta release at this point. I will close this because was attached to a bug, but we can discuss it here. #132 |
Beta Was this translation helpful? Give feedback.
-
instructor.exceptions.InstructorRetryException: litellm.BadRequestError: litellm.ContextWindowExceededError: ContextWindowExceededError: OpenAIException - Error code: 400 - {'error': {'message': "openai error: Invalid 'messages[1].content': string too long. Expected a string with maximum length 1048576, but got a string with length 1976458 instead. (request id: 2024121114571845225027487090505) (request id: 2024121114572763326430369052749) (request id: 2024121114572718113572229234539) (request id: 2024121114572676525390052321699) (request id: 202412111457268173112210109997) (request id: 2024121114572460819032852968468)", 'type': 'invalid_request_error', 'param': 'messages[1].content', 'code': 'string_above_max_length'}}
Beta Was this translation helpful? Give feedback.
All reactions