Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Add a setting to not use t5 encoder for train #94

Open
Bocchi-Chan2023 opened this issue Jun 16, 2024 · 4 comments
Open

Comments

@Bocchi-Chan2023
Copy link

Bocchi-Chan2023 commented Jun 16, 2024

Describe the feature
Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.

Motivation
I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.

Related resources

Additional context
For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.

@Bocchi-Chan2023 Bocchi-Chan2023 changed the title [Feature request] Add a setting to not train t5 encoder [Feature request] Add a setting to not use t5 encoder for train Jun 17, 2024
@C0nsumption
Copy link

Describe the feature Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.

Motivation I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.

Related resources

Additional context For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.

Can you expand on this? I’m trying to train, wanted some insight on GPU requirements.

@Bocchi-Chan2023
Copy link
Author

Describe the feature Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.
Motivation I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.
Related resources
Additional context For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.

Can you expand on this? I’m trying to train, wanted some insight on GPU requirements.

Uses 20GB of vram for batch 1 768x training for now

@Sunburst7
Copy link

申请认领

@Sunburst7
Copy link

bug report

I found a bug when I try to train the model by following the README's instruction

Traceback (most recent call last):
  File "hydit/train_deepspeed.py", line 531, in <module>
Traceback (most recent call last):
  File "hydit/train_deepspeed.py", line 531, in <module>
Traceback (most recent call last):
  File "hydit/train_deepspeed.py", line 531, in <module>
        main(get_args())main(get_args())

  File "hydit/train_deepspeed.py", line 208, in main
  File "hydit/train_deepspeed.py", line 208, in main
    main(get_args())
  File "hydit/train_deepspeed.py", line 208, in main
    with open(f"{experiment_dir}/args.json", 'w') as f:
    with open(f"{experiment_dir}/args.json", 'w') as f:PermissionError
: [Errno 13] Permission denied: '/args.json'
    PermissionErrorwith open(f"{experiment_dir}/args.json", 'w') as f:
: [Errno 13] Permission denied: '/args.json'
PermissionError: [Errno 13] Permission denied: '/args.json'

after that i check the code and find error in function "create_exp_folder":

def create_exp_folder(args, rank):
    if rank == 0:
        os.makedirs(args.results_dir, exist_ok=True)
    existed_experiments = list(Path(args.results_dir).glob("*dit*"))
    if len(existed_experiments) == 0:
        experiment_index = 1
    else:
        existed_experiments.sort()
        print('existed_experiments', existed_experiments)
        experiment_index = max([int(x.stem.split('-')[0]) for x in existed_experiments]) + 1
    dist.barrier()
    model_string_name = args.task_flag if args.task_flag else args.model.replace("/", "-")
    experiment_dir = f"{args.results_dir}/{experiment_index:03d}-{model_string_name}"       # Create an experiment folder
    checkpoint_dir = f"{experiment_dir}/checkpoints"                                        # Stores saved model checkpoints
    if rank == 0:
        os.makedirs(checkpoint_dir, exist_ok=True)
        logger = create_logger(experiment_dir)
        logger.info(f"Experiment directory created at {experiment_dir}")
    else:
        logger = create_logger()
        experiment_dir = "" # here!

    return experiment_dir, checkpoint_dir, logger

in the distribute data-parallel training system, the subprocess whose rank is not zero got the empty experiment_dir so they tried to open /args.json which they don't have access to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants