Resume grpo run from arbitrary checkpoint step via trainer.checkpoint.load_step #425

DNXie · 2025-10-15T21:29:25Z

Enable resume GRPO runs from a specific step.

Our current implementation technically already support resuming from checkpoint. But it needs to be more obvious to the users how to use it.

There are two possible designs we can use.

Design 1 (minimal changes)

We can force user to always use initial_load_path

trainer:
  checkpoint:
    enable: true
    folder: ./checkpoint_v2
    initial_load_path: ./checkpoint/step-200
    initial_load_in_hf: false    
    last_save_in_hf: true
    interval: 500
    async_mode: "disabled"

But the tricky part is that this line:

folder: ./checkpoint_v2

folder cannot exist, otherwise titan would ignore initial_load_path. So if users want to resume from a saved checkpoint, they have to start from step 0 and use a new folder to save the checkpoints.
We probably want to add a comment here in the config to make it clear.

Risk:

If later we have the replay_buffer and dataloader checkpoint saved, there would cause a version misalignment problem. Because the training starts from step 1 all the time.

Design 2 (this PR)

With the current design, to “resume,” users had to point initial_load_path at weights and also create a new folder because the Titan checkpointer ignores initial_load_path if the checkpoint folder already exists. This forced runs to restart at step 0, breaking version alignment. This PR introduces load_step to resume from an exact step without folder shenanigans or step resets.

With this PR, when load_step > 0, we:

Materialize the trainer’s weights at load_step in TorchStore.
Update the Generator (and optionally ReferenceModel) to that same version.
Start rollouts/training so new episodes are tagged with policy_version == load_step, unblocking ReplayBuffer.sample(curr_policy_version=...).

Key changes

New config knob: trainer.checkpoint.load_step (int).
On startup (after ts.initialize(...)):
- trainer.push_weights(load_step) → ensure weights exist at that version in TorchStore.
- policy.update_weights(load_step) (optional: ref_model.update_weights(load_step)).
training_step now starts at max(load_step, 0).
Added inline comments in the config with necessary explanation.

Example:

Resume from ./checkpoint/step-200

trainer:
  checkpoint:
    enable: true
    folder: ./checkpoint  # Directory to save or resume checkpoints (default: ./checkpoints)
    load_step: 200         # Step to load from; cannot be hf ckpt; -1 means load from initial_load_path. (default: -1)
    initial_load_path: hf://${model} # Optional: path or HF identifier to load model weights initially, will be ignored if `folder` exists
    initial_load_in_hf: true      # If true, interpret initial_load_path as a HuggingFace model repo
    last_save_in_hf: true
    interval: 500
    async_mode: "disabled"

Start from scratch and load from initial_load_path

trainer:
  checkpoint:
    enable: true
    folder: ./checkpoint  # Directory to save or resume checkpoints (default: ./checkpoints)
    load_step:-1         # Step to load from; cannot be hf ckpt; -1 means load from initial_load_path. (default: -1)
    initial_load_path: hf://${model} # Optional: path or HF identifier to load model weights initially, will be ignored if `folder` exists
    initial_load_in_hf: true      # If true, interpret initial_load_path as a HuggingFace model repo
    last_save_in_hf: true
    interval: 500
    async_mode: "disabled"

Tests

Resume at load_step=200; verify first rollouts carry generator_version == 200.
Run with checkpoint.interval=10; confirm a new checkpoint is saved at ./checkpoint/step-210.

TODO:

wandb log still starts from step 0
Update other yaml/py files

DNXie · 2025-10-15T21:34:49Z

Will update all the other configs once I got preliminary approval on this PR.

ebsmothers

At a high level this makes sense to me. Two main comments:

(1) Is there some way that we can test this in CI? It would be very helpful given that checkpoint resume bugs are a huge pain to catch and reproduce
(2) Imo our checkpoint config is starting to get a bit unintuitive. We should think about ways we can consolidate/simplify some of the fields

DNXie · 2025-10-16T16:47:11Z

@ebsmothers

(2) Imo our checkpoint config is starting to get a bit unintuitive. We should think about ways we can consolidate/simplify some of the fields

Yes I agree. But I am designing this based on Titan's implementation. I agree this can be a bit confusing. Alternative design is that we can remove this load_step and force user to always use initial_load_path

trainer:
  checkpoint:
    enable: true
    folder: ./checkpoint_v2
    initial_load_path: ./checkpoint/step-200
    initial_load_in_hf: false    
    last_save_in_hf: true
    interval: 500
    async_mode: "disabled"

But the caveat is that: folder: ./checkpoint_v2. folder cannot exist, otherwise titan would ignore initial_load_path. So if users want to resume from a saved checkpoint, they have to start from step 0 and use a new folder to save the checkpoints.

But the risk is: If later we have the replay_buffer and dataloader checkpoint saved, there could be a version misalignment problem. Because the training starts from step 1 all the time, and version number starts from the load_step

cc @allenwang28 @joecummings

allenwang28 · 2025-10-16T17:28:25Z

Suggestions:

Start with a load_step: -1 default and inherit the same comments in our config in the same way that Titan does. That way we're not re-inventing the logic for how checkpointing is loaded - it will inherit the same behavior from Titan
Instead of trainer.load_weights => trainer.push_weights() to initialize the generator, is it possible to leverage our existing use_dcp path in Generator to also load the DCP checkpoint? Then both trainer and generator can initialize simultaneously and we don't have to wait for the weight sync to happen (faster startup)

casteryh · 2025-10-16T22:41:18Z

Suggestions:

Start with a load_step: -1 default and inherit the same comments in our config in the same way that Titan does. That way we're not re-inventing the logic for how checkpointing is loaded - it will inherit the same behavior from Titan

Instead of trainer.load_weights => trainer.push_weights() to initialize the generator, is it possible to leverage our existing use_dcp path in Generator to also load the DCP checkpoint? Then both trainer and generator can initialize simultaneously and we don't have to wait for the weight sync to happen (faster startup)

use_dcp path requires the checkpoint to be in hf format (is titan's checkpoint in hf format?)
also I thought we were deprecating the dcp path?

DNXie · 2025-10-16T22:47:36Z

@casteryh

use_dcp path requires the checkpoint to be in hf format (is titan's checkpoint in hf format?)

No. Titan's checkpoints are not in hf format.

DNXie added 3 commits October 15, 2025 11:34

add load step. still testing

c3e153e

a working version

a6aa373

clean up

863f379

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 15, 2025

cleanup

03b5823

DNXie requested a review from ebsmothers October 15, 2025 21:34

typo

1eeabe0

DNXie mentioned this pull request Oct 15, 2025

RFC: Checkpointing Beyond Model Weights (Why we can’t do it now, and what we’ll do next) #433

Open

ebsmothers reviewed Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resume grpo run from arbitrary checkpoint step via trainer.checkpoint.load_step #425

Resume grpo run from arbitrary checkpoint step via trainer.checkpoint.load_step #425

Uh oh!

DNXie commented Oct 15, 2025 •

edited

Loading

Uh oh!

DNXie commented Oct 15, 2025

Uh oh!

ebsmothers left a comment

Uh oh!

DNXie commented Oct 16, 2025 •

edited

Loading

Uh oh!

allenwang28 commented Oct 16, 2025

Uh oh!

casteryh commented Oct 16, 2025

Uh oh!

DNXie commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Resume grpo run from arbitrary checkpoint step via trainer.checkpoint.load_step #425

Are you sure you want to change the base?

Resume grpo run from arbitrary checkpoint step via trainer.checkpoint.load_step #425

Uh oh!

Conversation

DNXie commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design 1 (minimal changes)

Risk:

Design 2 (this PR)

Key changes

Example:

Tests

TODO:

Uh oh!

DNXie commented Oct 15, 2025

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

DNXie commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allenwang28 commented Oct 16, 2025

Uh oh!

casteryh commented Oct 16, 2025

Uh oh!

DNXie commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DNXie commented Oct 15, 2025 •

edited

Loading

DNXie commented Oct 16, 2025 •

edited

Loading