如何在只有48G显存的资源下训练？ #5

yfangZhang · 2024-07-31T12:28:22Z

1.尝试将batch_size调为32，mini_batch_size调为1，在进行50step训练后，保存权重时仍发生OOM
2.将 model = LlamaForCausalLM.from_pretrained(
base_model,
load_in_8bit=True,
torch_dtype=torch.float32,
device_map="auto",
)改为
model = LlamaForCausalLM.from_pretrained(
base_model,
load_in_8bit=True,
torch_dtype=torch.float16,
device_map="auto",
)会报如下错误：
File "/opt/anaconda3/envs/llm_envs/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in call_impl
return forward_call(*args, **kwargs)
File "/opt/anaconda3/envs/llm_envs/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 711, in forward
KG_infused = tmp.index_copy(0, index_fixed, attn_output)
RuntimeError: index_copy(): self and source expected to have the same dtype, but got (self) Float and (source) Half
3、如何进行多卡训练？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

如何在只有48G显存的资源下训练？ #5

如何在只有48G显存的资源下训练？ #5

yfangZhang commented Jul 31, 2024

如何在只有48G显存的资源下训练？ #5

如何在只有48G显存的资源下训练？ #5

Comments

yfangZhang commented Jul 31, 2024