-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Add a setting to not use t5 encoder for train #94
Comments
Can you expand on this? I’m trying to train, wanted some insight on GPU requirements. |
Uses 20GB of vram for batch 1 768x training for now |
申请认领 |
bug reportI found a bug when I try to train the model by following the README's instruction Traceback (most recent call last):
File "hydit/train_deepspeed.py", line 531, in <module>
Traceback (most recent call last):
File "hydit/train_deepspeed.py", line 531, in <module>
Traceback (most recent call last):
File "hydit/train_deepspeed.py", line 531, in <module>
main(get_args())main(get_args())
File "hydit/train_deepspeed.py", line 208, in main
File "hydit/train_deepspeed.py", line 208, in main
main(get_args())
File "hydit/train_deepspeed.py", line 208, in main
with open(f"{experiment_dir}/args.json", 'w') as f:
with open(f"{experiment_dir}/args.json", 'w') as f:PermissionError
: [Errno 13] Permission denied: '/args.json'
PermissionErrorwith open(f"{experiment_dir}/args.json", 'w') as f:
: [Errno 13] Permission denied: '/args.json'
PermissionError: [Errno 13] Permission denied: '/args.json' after that i check the code and find error in function "create_exp_folder": def create_exp_folder(args, rank):
if rank == 0:
os.makedirs(args.results_dir, exist_ok=True)
existed_experiments = list(Path(args.results_dir).glob("*dit*"))
if len(existed_experiments) == 0:
experiment_index = 1
else:
existed_experiments.sort()
print('existed_experiments', existed_experiments)
experiment_index = max([int(x.stem.split('-')[0]) for x in existed_experiments]) + 1
dist.barrier()
model_string_name = args.task_flag if args.task_flag else args.model.replace("/", "-")
experiment_dir = f"{args.results_dir}/{experiment_index:03d}-{model_string_name}" # Create an experiment folder
checkpoint_dir = f"{experiment_dir}/checkpoints" # Stores saved model checkpoints
if rank == 0:
os.makedirs(checkpoint_dir, exist_ok=True)
logger = create_logger(experiment_dir)
logger.info(f"Experiment directory created at {experiment_dir}")
else:
logger = create_logger()
experiment_dir = "" # here!
return experiment_dir, checkpoint_dir, logger in the distribute data-parallel training system, the subprocess whose rank is not zero got the empty experiment_dir so they tried to open /args.json which they don't have access to |
Describe the feature
Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.
Motivation
I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.
Related resources
Additional context
For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.
The text was updated successfully, but these errors were encountered: