When max_steps < save_steps with deepspeed zero3 stage #31624

macheng6 · 2024-06-26T03:22:29Z

System Info

transformers 4.41.1

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

When I set max_steps=100 and save_steps=200, I found that the trainer could not save the trained weights and was blocked somewhere, causing the program to be unable to stop.
the code is:
if training_args.do_train: model.gradient_checkpointing_enable() model.enable_input_require_grads() trainer.train(args.resume_from_checkpoint) # For convenience, we also re-save the tokenizer to the same directory, # so that you can share your model easily on huggingface.co/models =) if trainer.is_world_process_zero(): trainer.save_model() tokenizer.save_pretrained(training_args.output_dir)

Expected behavior

I hope that :

either an error message will be displayed, prompting the user that max_steps<save_steps,
or the last trained weights can be saved.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-06-26T09:35:33Z

cc @muellerzr @SunMarc

macheng6 · 2024-06-26T09:39:54Z

When I removed trainer.is_world_process_zero(), the code ran normally, but if it is not removed, the code blocks here state_dict = self.accelerator.get_state_dict(self.deepspeed) of the function trainer.save_model()

SunMarc · 2024-06-26T13:20:20Z

Hi @macheng6, are you sure that this is an issue with max_steps and save_steps ? If you set max_steps > save_steps, do you get the issue and is the the code belowtrainer.is_world_process_zero()running fine ? If you have a minimal reproducer, that would be great !

macheng6 changed the title ~~When max_steps < save_steps~~ When max_steps < save_steps with deepspeed zero3 stage Jun 26, 2024

amyeroberts added DeepSpeed trainer labels Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When max_steps < save_steps with deepspeed zero3 stage #31624

When max_steps < save_steps with deepspeed zero3 stage #31624

macheng6 commented Jun 26, 2024 •

edited

Loading

amyeroberts commented Jun 26, 2024

macheng6 commented Jun 26, 2024 •

edited

Loading

SunMarc commented Jun 26, 2024

When max_steps < save_steps with deepspeed zero3 stage #31624

When max_steps < save_steps with deepspeed zero3 stage #31624

Comments

macheng6 commented Jun 26, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Jun 26, 2024

macheng6 commented Jun 26, 2024 • edited Loading

SunMarc commented Jun 26, 2024

macheng6 commented Jun 26, 2024 •

edited

Loading

macheng6 commented Jun 26, 2024 •

edited

Loading