OOM 问题, 显卡是A00 40G #42

gongye19 · 2024-05-08T08:00:17Z

用llama factory进行sft可以使用deepspeed zero2 微调llama3-8B的模型，但这个框架就算batch设为1，用deepspeed zero2也会报OOM。
用zero3训练会变得很慢，出现这个问题：

2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time

fe1ixxu · 2024-05-12T18:21:14Z

OOM issue could be the reason that llama3 has 128K vocab size, while llama2 is 32K.

gongye19 · 2024-05-13T08:54:04Z

OOM issue could be the reason that llama3 has 128K vocab size, while llama2 is 32K.

I tried deepseek-7B， same question

fe1ixxu · 2024-05-14T20:39:28Z

The deepseek vocab size is also large -- 100K. The memory I used for training is 64GB for llama-2 with 8/16 GPUs. Maybe you want to try using fsdp.

gongye19 · 2024-05-15T02:12:45Z

The deepseek vocab size is also large -- 100K. The memory I used for training is 64GB for llama-2 with 8/16 GPUs. Maybe you want to try using fsdp.

谢谢，我现在是先用llama factory进行sft，再到你的框架上进行cpo。这样可以吗？

moore3930 · 2024-11-04T17:46:11Z

Technically, 1 GPU (80G) should be fine to fine-tune LLaMA 7B with Lora, but it seems always OOM under your codebase if we do not use 8 GPUs. I am just wondering why it costs a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM 问题, 显卡是A00 40G #42

OOM 问题, 显卡是A00 40G #42

gongye19 commented May 8, 2024

fe1ixxu commented May 12, 2024

gongye19 commented May 13, 2024

fe1ixxu commented May 14, 2024 •

edited

Loading

gongye19 commented May 15, 2024

moore3930 commented Nov 4, 2024

OOM 问题, 显卡是A00 40G #42

OOM 问题, 显卡是A00 40G #42

Comments

gongye19 commented May 8, 2024

fe1ixxu commented May 12, 2024

gongye19 commented May 13, 2024

fe1ixxu commented May 14, 2024 • edited Loading

gongye19 commented May 15, 2024

moore3930 commented Nov 4, 2024

fe1ixxu commented May 14, 2024 •

edited

Loading