Replies: 1 comment
-
Hello @mophilly! Sorry, was out for a day just for rest and close some Issues. I strongly discourage Fine-tuning/training smaller models for this, for the simple reason that bigger models will always do the trick, including quantized models with layers removed. One of the reasons is simple the bang for the buck is not big enough, and eventually i will release models and a treated dataset for this. Im kinda lost. I think we should take this on a meeting to see what the need really is, because the mapping situation and "Identify 20 source documents that represent a reasonable set of variants, the "training set"." can be necessary, but its called "Few-shot learning" and you have another thecniques in prompt eng that should fix the problem. I still dont have them, but im pretty sure they can fix it. |
Beta Was this translation helpful? Give feedback.
-
The short question: What is the process to train a model for my specific set of document variants?
The long version...
My first effort with ExtractThinker produced a correct json result from a PDF, save for one minor error which is not a showstopper at this point. To achieve this I designed the pydantic classes to mirror the structure of the source document. With that success I thought I was ready to begin classification of the various source documents.
However, the pydantic classes I created do not conform to the data model used in the target dbms. Mapping fields quickly became messy. After a bit of study, I have come to believe that the pydantic classes should follow the target data model, and then train the model to recognize the various source documents and extract data that conforms to the data model.
With the in mind. I plan to attempt to train it using a set of source documents that are considered to be confidential by the document owners. Toward that end, I have set up ollama and loaded the llama3-8B model. I have space for a larger model, and all of this experimentation is running on an Apple M1 system.
Then run the training set again to verify correct results. Repeat until happiness is achieved, adding validation detail at each iteration.
At that point I think I will be ready to implement ExtractThinker classification.
Is this a good plan?
Is there a more efficient method or approach?
Beta Was this translation helpful? Give feedback.
All reactions