[Feature request] Add a setting to not use t5 encoder for train #94

Bocchi-Chan2023 · 2024-06-16T13:43:50Z

Describe the feature
Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.

Motivation
I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.

Related resources

Additional context
For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.

C0nsumption · 2024-06-17T21:49:03Z

Describe the feature Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.

Motivation I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.

Related resources

Additional context For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.

Can you expand on this? I’m trying to train, wanted some insight on GPU requirements.

Bocchi-Chan2023 · 2024-06-18T17:55:02Z

Describe the feature Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.
Motivation I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.
Related resources
Additional context For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.

Can you expand on this? I’m trying to train, wanted some insight on GPU requirements.

Uses 20GB of vram for batch 1 768x training for now

Sunburst7 · 2024-08-13T09:22:37Z

申请认领

Sunburst7 · 2024-08-15T03:15:54Z

bug report

I found a bug when I try to train the model by following the README's instruction

Traceback (most recent call last):
  File "hydit/train_deepspeed.py", line 531, in <module>
Traceback (most recent call last):
  File "hydit/train_deepspeed.py", line 531, in <module>
Traceback (most recent call last):
  File "hydit/train_deepspeed.py", line 531, in <module>
        main(get_args())main(get_args())

  File "hydit/train_deepspeed.py", line 208, in main
  File "hydit/train_deepspeed.py", line 208, in main
    main(get_args())
  File "hydit/train_deepspeed.py", line 208, in main
    with open(f"{experiment_dir}/args.json", 'w') as f:
    with open(f"{experiment_dir}/args.json", 'w') as f:PermissionError
: [Errno 13] Permission denied: '/args.json'
    PermissionErrorwith open(f"{experiment_dir}/args.json", 'w') as f:
: [Errno 13] Permission denied: '/args.json'
PermissionError: [Errno 13] Permission denied: '/args.json'

after that i check the code and find error in function "create_exp_folder":

def create_exp_folder(args, rank):
    if rank == 0:
        os.makedirs(args.results_dir, exist_ok=True)
    existed_experiments = list(Path(args.results_dir).glob("*dit*"))
    if len(existed_experiments) == 0:
        experiment_index = 1
    else:
        existed_experiments.sort()
        print('existed_experiments', existed_experiments)
        experiment_index = max([int(x.stem.split('-')[0]) for x in existed_experiments]) + 1
    dist.barrier()
    model_string_name = args.task_flag if args.task_flag else args.model.replace("/", "-")
    experiment_dir = f"{args.results_dir}/{experiment_index:03d}-{model_string_name}"       # Create an experiment folder
    checkpoint_dir = f"{experiment_dir}/checkpoints"                                        # Stores saved model checkpoints
    if rank == 0:
        os.makedirs(checkpoint_dir, exist_ok=True)
        logger = create_logger(experiment_dir)
        logger.info(f"Experiment directory created at {experiment_dir}")
    else:
        logger = create_logger()
        experiment_dir = "" # here!

    return experiment_dir, checkpoint_dir, logger

in the distribute data-parallel training system, the subprocess whose rank is not zero got the empty experiment_dir so they tried to open /args.json which they don't have access to

Bocchi-Chan2023 changed the title ~~[Feature request] Add a setting to not train t5 encoder~~ [Feature request] Add a setting to not use t5 encoder for train Jun 17, 2024

tencent-adm mentioned this issue Jul 1, 2024

2024腾讯犀牛鸟开源人才培养计划—HunyuanDiT Tencent/OpenSourceTalent#51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Add a setting to not use t5 encoder for train #94

[Feature request] Add a setting to not use t5 encoder for train #94

Bocchi-Chan2023 commented Jun 16, 2024 •

edited

Loading

C0nsumption commented Jun 17, 2024

Bocchi-Chan2023 commented Jun 18, 2024

Sunburst7 commented Aug 13, 2024

Sunburst7 commented Aug 15, 2024

[Feature request] Add a setting to not use t5 encoder for train #94

[Feature request] Add a setting to not use t5 encoder for train #94

Comments

Bocchi-Chan2023 commented Jun 16, 2024 • edited Loading

C0nsumption commented Jun 17, 2024

Bocchi-Chan2023 commented Jun 18, 2024

Sunburst7 commented Aug 13, 2024

Sunburst7 commented Aug 15, 2024

bug report

Bocchi-Chan2023 commented Jun 16, 2024 •

edited

Loading