annotation and training strategy #155

mophilly · 2024-12-27T16:27:59Z

mophilly
Dec 27, 2024
Collaborator

The short question: What is the process to train a model for my specific set of document variants?

The long version...

My first effort with ExtractThinker produced a correct json result from a PDF, save for one minor error which is not a showstopper at this point. To achieve this I designed the pydantic classes to mirror the structure of the source document. With that success I thought I was ready to begin classification of the various source documents.

However, the pydantic classes I created do not conform to the data model used in the target dbms. Mapping fields quickly became messy. After a bit of study, I have come to believe that the pydantic classes should follow the target data model, and then train the model to recognize the various source documents and extract data that conforms to the data model.

With the in mind. I plan to attempt to train it using a set of source documents that are considered to be confidential by the document owners. Toward that end, I have set up ollama and loaded the llama3-8B model. I have space for a larger model, and all of this experimentation is running on an Apple M1 system.

Identify 20 source documents that represent a reasonable set of variants, the "training set".
Ask the AI to craft a json file for each document in the training set.
Correct the json results manually.
Submit the corrected json and the source document to the AI.

Then run the training set again to verify correct results. Repeat until happiness is achieved, adding validation detail at each iteration.

At that point I think I will be ready to implement ExtractThinker classification.

Is this a good plan?
Is there a more efficient method or approach?

enoch3712 · 2024-12-29T11:08:44Z

enoch3712
Dec 29, 2024
Maintainer

Hello @mophilly! Sorry, was out for a day just for rest and close some Issues.

I strongly discourage Fine-tuning/training smaller models for this, for the simple reason that bigger models will always do the trick, including quantized models with layers removed. One of the reasons is simple the bang for the buck is not big enough, and eventually i will release models and a treated dataset for this.

Im kinda lost. I think we should take this on a meeting to see what the need really is, because the mapping situation and "Identify 20 source documents that represent a reasonable set of variants, the "training set"." can be necessary, but its called "Few-shot learning" and you have another thecniques in prompt eng that should fix the problem. I still dont have them, but im pretty sure they can fix it.

1 reply

mophilly Feb 16, 2025
Collaborator Author

A chat would be useful, for me at least. I have some questions that dangle between this project and broader AI mechanics.

mophilly · 2025-02-16T20:16:44Z

mophilly
Feb 16, 2025
Collaborator Author

As I learn how to use Pydantic classes, much of this question has been resolved. Still there are others that come to mind. Today the question is how to add to the prompt sent to the LLM.

Given a complex class that is made of several other classes, where there are select classes where additional comment may help. Class Statement includes elements that are List populate with other classes, say, Class InventoryDetail. InventoryDetail is itself a hierarchy similar to a complex bill of material that include parts and assemblies of parts.

For the assemblies, I want to advise the LLM that the source data is a sparse matrix (some cells are empty in some rows), and to include every row. Consider the class InventoryDetail, where do I amend the AI prompt with this advice?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annotation and training strategy #155

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

annotation and training strategy #155

mophilly Dec 27, 2024 Collaborator

Replies: 2 comments · 1 reply

enoch3712 Dec 29, 2024 Maintainer

mophilly Feb 16, 2025 Collaborator Author

mophilly Feb 16, 2025 Collaborator Author

mophilly
Dec 27, 2024
Collaborator

Replies: 2 comments 1 reply

enoch3712
Dec 29, 2024
Maintainer

mophilly Feb 16, 2025
Collaborator Author

mophilly
Feb 16, 2025
Collaborator Author