Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【GRPO】multinode training cuda error #3307

Open
glennccc opened this issue Feb 27, 2025 · 11 comments
Open

【GRPO】multinode training cuda error #3307

glennccc opened this issue Feb 27, 2025 · 11 comments

Comments

@glennccc
Copy link

glennccc commented Feb 27, 2025

2 nodes (16 A100 GPUs)

training script is from https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/multi_node1.sh

swift rlhf
--rlhf_type grpo
--model path/Qwen2.5-7B-Instruct
--reward_funcs accuracy format
--use_vllm true
--vllm_device auto
--vllm_gpu_memory_utilization 0.5
--vllm_max_model_len 4096
--train_type full
--torch_dtype bfloat16
--dataset 'path/dataset/MATH-lighteval'
--max_completion_length 2048
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 2
--eval_steps 200
--save_steps 200
--save_total_limit 2
--logging_steps 5
--max_length 4096
--output_dir output/Qwen2.5-7B-Instruct_GRPO
--warmup_ratio 0.05
--dataloader_num_workers 4
--dataset_num_proc 4
--num_generations 8
--temperature 0.9
--system 'examples/train/grpo/prompt.txt'
--deepspeed zero2
--log_completions true

==========================================================================================
however, get cuda out of memory error. why your script on 2 node ( 4 gpus each) is ok, but 16 gpus get these error?

[rank0]:     engine = AsyncLLMEngine.from_engine_args(self.engine_args)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 644, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 594, in init
[rank0]:     self.engine = self._engine_class(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 267, in init
[rank0]:     super().init(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 276, in init
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 429, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 121, in initialize_cache
[rank0]:     self.collective_rpc("initialize_cache",
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/utils.py", line 2220, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 306, in initialize_cache
[rank0]:     self._init_cache_engine()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 311, in _init_cache_engine
[rank0]:     self.cache_engine = [
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 312, in
[rank0]:     CacheEngine(self.cache_config, self.model_config,
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 69, in init
[rank0]:     self.gpu_cache = self._allocate_kv_cache(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 103, in _allocate_kv_cache
[rank0]:     layer_kv_cache = torch.zeros(alloc_shape,
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.42 GiB. GPU 7 has a total capacity of 79.35 GiB of which 950.19 MiB is free. Process 412980 has 29.59 GiB memory in use. Process 412968 has 48.83 GiB memory in use. Of the allocated memory 48.26 GiB is allocated by PyTorch, and 73.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@glennccc glennccc changed the title multinode training cuda error 【GRPO】multinode training cuda error Feb 27, 2025
@glennccc
Copy link
Author

@Jintao-Huang @hjh0119 can you help me out?
If using 16 gpus(2nodes), how to change the scrpit?

@hjh0119
Copy link
Collaborator

hjh0119 commented Feb 27, 2025

maybe similar issue ? #3300

@glennccc
Copy link
Author

maybe similar issue ? #3300

not the same problem i think.
actually i use this script instead of swift rlhf, is it wrong with my scirpt?

python -m torch.distributed.run --nnode=$WORLD_SIZE --nproc_per_node=$NUM_PROCESSES --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT
path/swift/cli/rlhf.py
--rlhf_type grpo
.....

@hjh0119
Copy link
Collaborator

hjh0119 commented Feb 27, 2025

Try using swift rlhf instead of python cli/rlhf.py. The swift rlhf command actually runs cli/main.py, which in turn executes torchrun to enable Distributed Data Parallel (DDP).

This might solve your problem.

@glennccc
Copy link
Author

Try using swift rlhf instead of python cli/rlhf.py. The swift rlhf command actually runs cli/main.py, which in turn executes torchrun to enable Distributed Data Parallel (DDP).

This might solve your problem.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NNODES=2
export NODE_RANK=0
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=$MASTER_PORT
export NPROC_PER_NODE=7

@glennccc
Copy link
Author

Try using swift rlhf instead of python cli/rlhf.py. The swift rlhf command actually runs cli/main.py, which in turn executes torchrun to enable Distributed Data Parallel (DDP).
This might solve your problem.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export NNODES=2 export NODE_RANK=0 export MASTER_ADDR=$MASTER_ADDR export MASTER_PORT=$MASTER_PORT export NPROC_PER_NODE=7

is it right? or do i need to change?

@hjh0119
Copy link
Collaborator

hjh0119 commented Feb 27, 2025

maybe in node2, set the NPROC_PER_NODE=8

@glennccc
Copy link
Author

maybe in node2, set the NPROC_PER_NODE=8

例子里我看设置的3,https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/multi_node1.sh,我理解是要留卡给vllm?set the NPROC_PER_NODE=8的话会报vllm oom的错。

@hjh0119
Copy link
Collaborator

hjh0119 commented Feb 27, 2025

NPROC_PER_NODE=7 in node1, NPROC_PER_NODE=8 in node2 ?

@glennccc
Copy link
Author

NPROC_PER_NODE=7 in node1, NPROC_PER_NODE=8 in node2 ?

目前我的训练平台上不支持两个节点提交不同的命令。那如果退其次,node1和node2都留一个gpu给vllm的话是否也可以呢?如何修改配置参数啊?

@tastelikefeet
Copy link
Collaborator

NPROC_PER_NODE=7 in node1, NPROC_PER_NODE=8 in node2 ?

目前我的训练平台上不支持两个节点提交不同的命令。那如果退其次,node1和node2都留一个gpu给vllm的话是否也可以呢?如何修改配置参数啊?

支持的,两个node都NPROC_PER_NODE=7即可

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants