clear cuda cache to help with memory leak/creep #1858

winglian · 2024-08-23T02:44:55Z

No description provided.

chiragjn · 2024-08-23T04:38:11Z

src/axolotl/core/trainer_builder.py

+        torch.cuda.empty_cache()
+        gc.collect()


Shouldn't gc.collect be first?
In my experience if there is a true leak (hanging references) empty cache cannot not get rid of leaks

If it helps, I can try and debug where the leak is given the training config

Thanks. I reordered the calls and added some context from reproduction

winglian · 2024-08-25T20:04:14Z

@chiragjn there were about 40 and 60 steps in

base_model: NousResearch/Meta-Llama-3.1-8B

plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.spectrum.SpectrumPlugin

spectrum_top_fraction: 0.5
# Optional if using a pre-scanned model as your base_model. Useful if using a model mirror
spectrum_model_name: meta-llama/Meta-Llama-3.1-8B
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_cross_entropy: true
# liger_fused_linear_cross_entropy: true

strict: false


chat_template: llama3

rl: dpo
datasets:
  - path: argilla/distilabel-intel-orca-dpo-pairs
    split: train
    type: llama3.icr

dataset_prepared_path: last_run_prepared
dataset_processes: 1
val_set_size: 0.02
output_dir: ./outputs/out

sequence_len: 2048
sample_packing: false
pad_to_sequence_len: false

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: deepspeed_configs/zero2.json
weight_decay: 0.0
special_tokens:
  pad_token: <|finetune_right_pad_id|>
  eos_token: <|eot_id|>

winglian · 2024-08-25T20:12:34Z

and this is with the gc/clear cache on each step

clear cuda cache to help with memory leak/creep

e8d47e5

chiragjn reviewed Aug 23, 2024

View reviewed changes

reverse order of gc

3b612b1

winglian merged commit 17af1d7 into main Aug 26, 2024
7 checks passed

winglian deleted the dpo-mem-leak branch August 26, 2024 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clear cuda cache to help with memory leak/creep #1858

clear cuda cache to help with memory leak/creep #1858

winglian commented Aug 23, 2024

chiragjn Aug 23, 2024

winglian Aug 25, 2024

winglian commented Aug 25, 2024

winglian commented Aug 25, 2024

clear cuda cache to help with memory leak/creep #1858

clear cuda cache to help with memory leak/creep #1858

Conversation

winglian commented Aug 23, 2024

chiragjn Aug 23, 2024

Choose a reason for hiding this comment

winglian Aug 25, 2024

Choose a reason for hiding this comment

winglian commented Aug 25, 2024

winglian commented Aug 25, 2024