Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent GPU memory usage of QLoRA (vs LoRA) with different numbers of GPUs #373

Open
albertoperdomo2 opened this issue Oct 15, 2024 · 5 comments

Comments

@albertoperdomo2
Copy link

Describe the bug

When validating the fms-hf-tuning v2.0.1 image, we ran our workloads across different GPU counts to review improvements associated with it. One thing that we tried was fine tuning using LoRA + full precision model (in this case mistralai/Mistral-7B-v0.3) and QLoRA + quantized model (in this case mistral-7b-v0.3-gptq) with the same settings to analyze the results, and we found out that with 8 GPUs, the QLoRA GPU memory usage was greater than the LoRA equivalent.

mistral_merged_gpu_total_memory_usage_max

Platform

RHOAI 2.12

Expected behavior

When running LoRA (with a full precision model) and QLoRA fine tuning (with the same model but quantized) the GPU memory usage is expected to always be lower in QLoRA, given the fact that the model parameters are in a lower precision.

@fabianlim
Copy link
Collaborator

I think the first thing that is confusing me, is that if you are keeping everything constant, and increasing the number of GPUs. then we do not expect memory consumption to increase. So can you post the arguments you are using to perform the different experiments with different number of GPUs

@albertoperdomo2
Copy link
Author

albertoperdomo2 commented Oct 15, 2024

@fabianlim these are the configs that we used for all the tests:

QLoRA config

The only variable here was the amount of GPUs.

{
  "training_data_path": "/mnt/output/dataset.json",
  "model_name_or_path": "/mnt/storage/model/mistral-7b-v0.3-gptq",
  "response_template": " ### Label:",
  "output_dir": "/mnt/output/fine-tuning",
  "save_model_dir": "/mnt/output/save-model-dir",
  "accelerate_launch_args": {
    "num_processes": 1,
    "num_machines": 1,
    "mixed_precision": "no",
    "dynamo_backend": "no",
    "downcast_bf16": "no",
    "main_training_function": "main",
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_use_sudo": false
  },
  "num_train_epochs": 1,
  "per_device_train_batch_size": 4,
  "per_device_eval_batch_size": 4,
  "gradient_accumulation_steps": 4,
  "eval_strategy": "no",
  "save_strategy": "no",
  "learning_rate": 0.00001,
  "weight_decay": 0,
  "lr_scheduler_type": "cosine",
  "max_seq_length": 1024,
  "include_tokens_per_second": true,
  "dataset_text_field": "output",
  "use_flash_attn": true,
  "auto_gptq": [
    "triton_v2"
  ],
  "fp16": true,
  "gradient_checkpointing": true,
  "lora_alpha": 16,
  "max_steps": -1,
  "packing": false,
  "peft_method": "lora",
  "r": 4,
  "target_modules": [
    "all-linear"
  ],
  "torch_dtype": "float16",
  "warmup_ratio": 0.03
}

LoRA config

The only variable here was the amount of GPUs.

{
  "training_data_path": "/mnt/output/dataset.json",
  "model_name_or_path": "/mnt/storage/model/mistral-7b-v0.3",
  "response_template": " ### Label:",
  "output_dir": "/mnt/output/fine-tuning",
  "save_model_dir": "/mnt/output/save-model-dir",
  "accelerate_launch_args": {
    "num_processes": 1,
    "num_machines": 1,
    "mixed_precision": "no",
    "dynamo_backend": "no",
    "downcast_bf16": "no",
    "main_training_function": "main",
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_use_sudo": false
  },
  "num_train_epochs": 1,
  "per_device_train_batch_size": 4,
  "per_device_eval_batch_size": 4,
  "gradient_accumulation_steps": 4,
  "eval_strategy": "no",
  "save_strategy": "no",
  "learning_rate": 0.00001,
  "weight_decay": 0,
  "lr_scheduler_type": "cosine",
  "max_seq_length": 1024,
  "include_tokens_per_second": true,
  "dataset_text_field": "output",
  "use_flash_attn": true,
  "gradient_checkpointing": true,
  "lora_alpha": 16,
  "max_steps": -1,
  "packing": false,
  "peft_method": "lora",
  "r": 4,
  "target_modules": [
    "all-linear"
  ],
  "warmup_ratio": 0.03
}

@fabianlim
Copy link
Collaborator

@anhuong do you know how come in accelerate_launch_args the FSDP sharding strategy is not specified. does it go to a default?

@albertoperdomo2 if you see in our benches, our settings are quite similar to yours, but you can see that when num_gpus go up from 1 to 2, the memory consumption decreases

https://github.com/foundation-model-stack/fms-acceleration/blob/main/scripts/benchmarks/refs/a100_80gb.csv#L102-L103

@albertoperdomo2
Copy link
Author

@fabianlim we have seen this behavior mainly by this particular model pair. We are planning on testing different equivalent models but I wonder if this particular model itself might be the issue. Do you have results for mistralai/Mistral-7B-v0.3/mistral-7b-v0.3-gptq?

@fabianlim
Copy link
Collaborator

@albertoperdomo2 no im sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants