You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
When I set max_steps=100 and save_steps=200, I found that the trainer could not save the trained weights and was blocked somewhere, causing the program to be unable to stop.
the code is: if training_args.do_train: model.gradient_checkpointing_enable() model.enable_input_require_grads() trainer.train(args.resume_from_checkpoint) # For convenience, we also re-save the tokenizer to the same directory, # so that you can share your model easily on huggingface.co/models =) if trainer.is_world_process_zero(): trainer.save_model() tokenizer.save_pretrained(training_args.output_dir)
Expected behavior
I hope that :
either an error message will be displayed, prompting the user that max_steps<save_steps,
or the last trained weights can be saved.
The text was updated successfully, but these errors were encountered:
macheng6
changed the title
When max_steps < save_steps
When max_steps < save_steps with deepspeed zero3 stage
Jun 26, 2024
When I removed trainer.is_world_process_zero(), the code ran normally, but if it is not removed, the code blocks here state_dict = self.accelerator.get_state_dict(self.deepspeed) of the function trainer.save_model()
Hi @macheng6, are you sure that this is an issue with max_steps and save_steps ? If you set max_steps > save_steps, do you get the issue and is the the code belowtrainer.is_world_process_zero()running fine ? If you have a minimal reproducer, that would be great !
System Info
transformers 4.41.1
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When I set max_steps=100 and save_steps=200, I found that the trainer could not save the trained weights and was blocked somewhere, causing the program to be unable to stop.
the code is:
if training_args.do_train: model.gradient_checkpointing_enable() model.enable_input_require_grads() trainer.train(args.resume_from_checkpoint) # For convenience, we also re-save the tokenizer to the same directory, # so that you can share your model easily on huggingface.co/models =) if trainer.is_world_process_zero(): trainer.save_model() tokenizer.save_pretrained(training_args.output_dir)
Expected behavior
I hope that :
The text was updated successfully, but these errors were encountered: