-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoresume and duration mismatch on reload #3357
Comments
@antoinebrl what if you set |
Your approach might solve the issue with the scheduler, which is how I identified the issue. However, the training still goes for longer than expected. For example, I do |
Yep... good point. I think it might also break on multiple resumptions. I will add a PR to at least require max_duration on init. |
Awesome! Thanks for tackling it 💪 For a quick solution on my side I added couple of exceptions with instructions to guide the users towards the right path:
|
I've added enforcements for the above restrictions. We will look at enabling autoresume + multiple fits, but its lower prioirty. |
Thanks for your reactivity on this one 💪 |
Reopening as we need to revert the guardrails. It seems that there is a scenario in which this does work, where the user was tracking the number of fit calls in a callback state_dict and skipping over already completed fit calls + not ever checkpointing inside a fit call. i'll do a more careful follow-on after release goes out |
Ok, unfortunately, we cannot impose these restrictions in Composer because it breaks a few existing workflows. Here is a rough scenario. You want to train a model on a series of 5 datasets in a row. You only checkpoint at Given this, I think we will unfortunately have to currently leave it as is. The longer term solution is for |
I see, thanks for the precisions. Would you mind sharing this callback? |
it would be roughly:
and then |
Thanks! I was trying to come up with a way for the callback to spike the training from within. |
What Mihir wrote seems reasonable to count the number of fit calls, then you just need to do some of your own timekeeping I believe. |
Description
While experimenting with the
autoresume
feature, I encountered issues related to duration and the scheduler. My error was providing theduration
argument to the.fit()
method instead of themax_duration
argument to theTrainer constructor
. Since the.fit()
method can be called multiple times sequentially, each call increases themax_duration
, as indicated in the code. Upon resumption, this offset causes an error in the scheduler becauset_max
becomes smaller thanmax_duration
.Using
max_duration
in theTrainer
constructor avoids this problem, so I will adopt this approach. Should the scenario described above be detected, and should a warning or error be raised? Essentially, ifautoresume=True
, thenmax_duration
should be specified in the__init__
, and the.fit()
method should only be called once in the script.The text was updated successfully, but these errors were encountered: