-
Notifications
You must be signed in to change notification settings - Fork 16
Resume grpo run from arbitrary checkpoint step via trainer.checkpoint.load_step #425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Will update all the other configs once I got preliminary approval on this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level this makes sense to me. Two main comments:
(1) Is there some way that we can test this in CI? It would be very helpful given that checkpoint resume bugs are a huge pain to catch and reproduce
(2) Imo our checkpoint config is starting to get a bit unintuitive. We should think about ways we can consolidate/simplify some of the fields
Yes I agree. But I am designing this based on Titan's implementation. I agree this can be a bit confusing. Alternative design is that we can remove this trainer:
checkpoint:
enable: true
folder: ./checkpoint_v2
initial_load_path: ./checkpoint/step-200
initial_load_in_hf: false
last_save_in_hf: true
interval: 500
async_mode: "disabled" But the caveat is that: But the risk is: If later we have the replay_buffer and dataloader checkpoint saved, there could be a version misalignment problem. Because the training starts from step 1 all the time, and version number starts from the |
Suggestions:
|
|
No. Titan's checkpoints are not in hf format. |
Enable resume GRPO runs from a specific step.
Our current implementation technically already support resuming from checkpoint. But it needs to be more obvious to the users how to use it.
There are two possible designs we can use.
Design 1 (minimal changes)
We can force user to always use
initial_load_path
But the tricky part is that this line:
folder
cannot exist, otherwise titan would ignoreinitial_load_path
. So if users want to resume from a saved checkpoint, they have to start from step 0 and use a new folder to save the checkpoints.We probably want to add a comment here in the config to make it clear.
Risk:
If later we have the replay_buffer and dataloader checkpoint saved, there would cause a version misalignment problem. Because the training starts from step 1 all the time.
Design 2 (this PR)
With the current design, to “resume,” users had to point
initial_load_path
at weights and also create a newfolder
because the Titan checkpointer ignoresinitial_load_path
if the checkpointfolder
already exists. This forced runs to restart at step 0, breaking version alignment. This PR introducesload_step
to resume from an exact step without folder shenanigans or step resets.With this PR, when
load_step > 0
, we:load_step
in TorchStore.policy_version == load_step
, unblockingReplayBuffer.sample(curr_policy_version=...)
.Key changes
trainer.checkpoint.load_step
(int).ts.initialize(...)
):trainer.push_weights(load_step)
→ ensure weights exist at that version in TorchStore.policy.update_weights(load_step)
(optional:ref_model.update_weights(load_step)
).training_step
now starts atmax(load_step, 0)
.Example:
Resume from
./checkpoint/step-200
Start from scratch and load from
initial_load_path
Tests
load_step=200
; verify first rollouts carrygenerator_version == 200
.checkpoint.interval=10
; confirm a new checkpoint is saved at./checkpoint/step-210
.TODO: