-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Examples for E2E Mistral training & deployment #673
base: main
Are you sure you want to change the base?
Conversation
Coming soon to AWS Machine Learning blog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it is quite different than our examples which are the equivalent of the examples in Transformers.
Should we create a directory like "aws-examples" or something where we would put them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the run_clm.py
script we already provide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distinction is in dataset loading, it uses load_from_disk to load a saved DatasetDict, before downloading from HF hub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I do not think this is really needed as a distinction. But that's fair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we should arther update the general on to make it support loading datasets from disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I think we should keep examples easy. Saving the dataset on disk is not really needed. There is a caching mechanism, we should just push them to use the dataset id and let the Datasets library handle the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our example, gsm8k is not pre-formatted for run_clm.py, which uses a single text column for training data, so the dataset needs to be formatted.
The reason we're loading from disk is to show customers (in the blog post) how the dataset needs to be formatted by leading them through it step-by-step (which ends up being dataset.py). While we can modify the training script to shape the dataset as necessary, we thought it'd be easier to show it in a standalone file - for the purposes of our blog post, run_clm.py isn't discussed in detail (the idea being it's plug and play with any LLM or dataset) so simplicity isn't necessarily our concern there.
Technically speaking, we could modify run_clm.py to shape the data itself, but I'd also argue it's better to have a training script that can take in any pre-formatted dataset, rather than having to modify it every time for new datasets. Hope this helps you see my point of view.
@pagezyhf @philschmid wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening the PR. I missing some context. Whats is the goal of that? I cannot see instruction what to do with the yaml file or how the scripts are used. If want to make an official example we should make sure that's understandable and useable for every user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we should arther update the general on to make it support loading datasets from disk.
So this is for an AWS Machine Learning blog, to give beginners to intermediates an idea of how to use AWS Trainium and AWS Inferentia2 for simple end-to-end training and inferencing. The YAML file is an Amazon CloudFormation template - it doesn't necessarily need to be included, so I can take that out. Overall, similar to the other examples in this directory we (will) have documentation elsewhere that refers to these examples. Michael suggested we might create an aws-examples directory, which would be fine as well. |
I would say let's me the tutorial as minimalist as possible and put it in an |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you! |
1 similar comment
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you! |
What does this PR do?
Adds examples which are used for an upcoming AWS Machine Learning blog.