-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoresume Validation with Max Duration #3358
Autoresume Validation with Max Duration #3358
Conversation
Do we want to keep the incremented functionality? If I load in a 10ba checkpoint with Trainer(..., max_duration='15ba') and I say fit(duration='12ba'), wouldn't I want it to train for 2 batches instead of 5. Wouldn't what I specify in fit override what I did in the constructor? |
Iirc the original behavior was modeled after what other training libraries do... but I agree it's perhaps not ideal. We should revisit this in a follow-on PR |
yeah we should this is very counter-intuitive |
@antoinebrl curious if you have any input given your recent github issue referencing multiple fit calls |
I must say this was a very counter intuitive bug to track down. This PR implements some fences which push the user towards the expected behavior but I agree the notion of training duration can be improved. Regarding the multiple calls to |
This reverts commit f0eae8a.
* Revert "Autoresume Validation with Max Duration (#3358)" This reverts commit f0eae8a. * add bump * bump deepspeed * revert deepspeed bump * add state dict --------- Co-authored-by: Saaketh Narayan <[email protected]>
* Revert "Autoresume Validation with Max Duration (#3358)" This reverts commit f0eae8a. * add bump * bump deepspeed * revert deepspeed bump * add state dict --------- Co-authored-by: Saaketh Narayan <[email protected]>
What does this PR do?
Additional checks based on #3357:
duration
infit
. Otherwise, Composer will autoresume from last checkpoint, and then the specified value ofmax_duration
fromfit
will be incremented to the loaded max duration. In short,fit(duration=...)
is not idempotent with autoresume.reset_time
infit
does not work with autoresume as it will force a restart in training. Instead, time should be ignored on load, which is used in initial load and not on autoresume load