Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When max_steps < save_steps with deepspeed zero3 stage #31624

Open
4 tasks
macheng6 opened this issue Jun 26, 2024 · 3 comments
Open
4 tasks

When max_steps < save_steps with deepspeed zero3 stage #31624

macheng6 opened this issue Jun 26, 2024 · 3 comments

Comments

@macheng6
Copy link

macheng6 commented Jun 26, 2024

System Info

transformers 4.41.1

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When I set max_steps=100 and save_steps=200, I found that the trainer could not save the trained weights and was blocked somewhere, causing the program to be unable to stop.
the code is:
if training_args.do_train: model.gradient_checkpointing_enable() model.enable_input_require_grads() trainer.train(args.resume_from_checkpoint) # For convenience, we also re-save the tokenizer to the same directory, # so that you can share your model easily on huggingface.co/models =) if trainer.is_world_process_zero(): trainer.save_model() tokenizer.save_pretrained(training_args.output_dir)

Expected behavior

I hope that :

  • either an error message will be displayed, prompting the user that max_steps<save_steps,
  • or the last trained weights can be saved.
@macheng6 macheng6 changed the title When max_steps < save_steps When max_steps < save_steps with deepspeed zero3 stage Jun 26, 2024
@amyeroberts
Copy link
Collaborator

cc @muellerzr @SunMarc

@macheng6
Copy link
Author

macheng6 commented Jun 26, 2024

When I removed trainer.is_world_process_zero(), the code ran normally, but if it is not removed, the code blocks here state_dict = self.accelerator.get_state_dict(self.deepspeed) of the function trainer.save_model()

@SunMarc
Copy link
Member

SunMarc commented Jun 26, 2024

Hi @macheng6, are you sure that this is an issue with max_steps and save_steps ? If you set max_steps > save_steps, do you get the issue and is the the code belowtrainer.is_world_process_zero()running fine ? If you have a minimal reproducer, that would be great !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants