Inconsistent GPU memory usage of QLoRA (vs LoRA) with different numbers of GPUs #373

albertoperdomo2 · 2024-10-15T12:35:58Z

Describe the bug

When validating the fms-hf-tuning v2.0.1 image, we ran our workloads across different GPU counts to review improvements associated with it. One thing that we tried was fine tuning using LoRA + full precision model (in this case mistralai/Mistral-7B-v0.3) and QLoRA + quantized model (in this case mistral-7b-v0.3-gptq) with the same settings to analyze the results, and we found out that with 8 GPUs, the QLoRA GPU memory usage was greater than the LoRA equivalent.

Platform

RHOAI 2.12

Expected behavior

When running LoRA (with a full precision model) and QLoRA fine tuning (with the same model but quantized) the GPU memory usage is expected to always be lower in QLoRA, given the fact that the model parameters are in a lower precision.

The text was updated successfully, but these errors were encountered:

fabianlim · 2024-10-15T13:24:59Z

I think the first thing that is confusing me, is that if you are keeping everything constant, and increasing the number of GPUs. then we do not expect memory consumption to increase. So can you post the arguments you are using to perform the different experiments with different number of GPUs

albertoperdomo2 · 2024-10-15T13:29:52Z

@fabianlim these are the configs that we used for all the tests:

QLoRA config

The only variable here was the amount of GPUs.

{
  "training_data_path": "/mnt/output/dataset.json",
  "model_name_or_path": "/mnt/storage/model/mistral-7b-v0.3-gptq",
  "response_template": " ### Label:",
  "output_dir": "/mnt/output/fine-tuning",
  "save_model_dir": "/mnt/output/save-model-dir",
  "accelerate_launch_args": {
    "num_processes": 1,
    "num_machines": 1,
    "mixed_precision": "no",
    "dynamo_backend": "no",
    "downcast_bf16": "no",
    "main_training_function": "main",
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_use_sudo": false
  },
  "num_train_epochs": 1,
  "per_device_train_batch_size": 4,
  "per_device_eval_batch_size": 4,
  "gradient_accumulation_steps": 4,
  "eval_strategy": "no",
  "save_strategy": "no",
  "learning_rate": 0.00001,
  "weight_decay": 0,
  "lr_scheduler_type": "cosine",
  "max_seq_length": 1024,
  "include_tokens_per_second": true,
  "dataset_text_field": "output",
  "use_flash_attn": true,
  "auto_gptq": [
    "triton_v2"
  ],
  "fp16": true,
  "gradient_checkpointing": true,
  "lora_alpha": 16,
  "max_steps": -1,
  "packing": false,
  "peft_method": "lora",
  "r": 4,
  "target_modules": [
    "all-linear"
  ],
  "torch_dtype": "float16",
  "warmup_ratio": 0.03
}

LoRA config

The only variable here was the amount of GPUs.

{
  "training_data_path": "/mnt/output/dataset.json",
  "model_name_or_path": "/mnt/storage/model/mistral-7b-v0.3",
  "response_template": " ### Label:",
  "output_dir": "/mnt/output/fine-tuning",
  "save_model_dir": "/mnt/output/save-model-dir",
  "accelerate_launch_args": {
    "num_processes": 1,
    "num_machines": 1,
    "mixed_precision": "no",
    "dynamo_backend": "no",
    "downcast_bf16": "no",
    "main_training_function": "main",
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_use_sudo": false
  },
  "num_train_epochs": 1,
  "per_device_train_batch_size": 4,
  "per_device_eval_batch_size": 4,
  "gradient_accumulation_steps": 4,
  "eval_strategy": "no",
  "save_strategy": "no",
  "learning_rate": 0.00001,
  "weight_decay": 0,
  "lr_scheduler_type": "cosine",
  "max_seq_length": 1024,
  "include_tokens_per_second": true,
  "dataset_text_field": "output",
  "use_flash_attn": true,
  "gradient_checkpointing": true,
  "lora_alpha": 16,
  "max_steps": -1,
  "packing": false,
  "peft_method": "lora",
  "r": 4,
  "target_modules": [
    "all-linear"
  ],
  "warmup_ratio": 0.03
}

fabianlim · 2024-10-15T14:43:49Z

@anhuong do you know how come in accelerate_launch_args the FSDP sharding strategy is not specified. does it go to a default?

@albertoperdomo2 if you see in our benches, our settings are quite similar to yours, but you can see that when num_gpus go up from 1 to 2, the memory consumption decreases

https://github.com/foundation-model-stack/fms-acceleration/blob/main/scripts/benchmarks/refs/a100_80gb.csv#L102-L103

albertoperdomo2 · 2024-10-16T08:36:46Z

@fabianlim we have seen this behavior mainly by this particular model pair. We are planning on testing different equivalent models but I wonder if this particular model itself might be the issue. Do you have results for mistralai/Mistral-7B-v0.3/mistral-7b-v0.3-gptq?

fabianlim · 2024-10-16T09:25:54Z

@albertoperdomo2 no im sorry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent GPU memory usage of QLoRA (vs LoRA) with different numbers of GPUs #373

Inconsistent GPU memory usage of QLoRA (vs LoRA) with different numbers of GPUs #373

albertoperdomo2 commented Oct 15, 2024

fabianlim commented Oct 15, 2024

albertoperdomo2 commented Oct 15, 2024 •

edited

Loading

fabianlim commented Oct 15, 2024

albertoperdomo2 commented Oct 16, 2024

fabianlim commented Oct 16, 2024

Inconsistent GPU memory usage of QLoRA (vs LoRA) with different numbers of GPUs #373

Inconsistent GPU memory usage of QLoRA (vs LoRA) with different numbers of GPUs #373

Comments

albertoperdomo2 commented Oct 15, 2024

Describe the bug

Platform

Expected behavior

fabianlim commented Oct 15, 2024

albertoperdomo2 commented Oct 15, 2024 • edited Loading

QLoRA config

LoRA config

fabianlim commented Oct 15, 2024

albertoperdomo2 commented Oct 16, 2024

fabianlim commented Oct 16, 2024

albertoperdomo2 commented Oct 15, 2024 •

edited

Loading