【GRPO】multinode training cuda error #3307

glennccc · 2025-02-27T07:27:29Z

2 nodes (16 A100 GPUs)

training script is from https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/multi_node1.sh

swift rlhf
--rlhf_type grpo
--model path/Qwen2.5-7B-Instruct
--reward_funcs accuracy format
--use_vllm true
--vllm_device auto
--vllm_gpu_memory_utilization 0.5
--vllm_max_model_len 4096
--train_type full
--torch_dtype bfloat16
--dataset 'path/dataset/MATH-lighteval'
--max_completion_length 2048
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 2
--eval_steps 200
--save_steps 200
--save_total_limit 2
--logging_steps 5
--max_length 4096
--output_dir output/Qwen2.5-7B-Instruct_GRPO
--warmup_ratio 0.05
--dataloader_num_workers 4
--dataset_num_proc 4
--num_generations 8
--temperature 0.9
--system 'examples/train/grpo/prompt.txt'
--deepspeed zero2
--log_completions true

==========================================================================================
however, get cuda out of memory error. why your script on 2 node ( 4 gpus each) is ok, but 16 gpus get these error?

[rank0]: engine = AsyncLLMEngine.from_engine_args(self.engine_args)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 644, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 594, in init
[rank0]: self.engine = self._engine_class(*args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 267, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 276, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 429, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 121, in initialize_cache
[rank0]: self.collective_rpc("initialize_cache",
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/utils.py", line 2220, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 306, in initialize_cache
[rank0]: self._init_cache_engine()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 311, in _init_cache_engine
[rank0]: self.cache_engine = [
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 312, in
[rank0]: CacheEngine(self.cache_config, self.model_config,
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 69, in init
[rank0]: self.gpu_cache = self._allocate_kv_cache(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 103, in _allocate_kv_cache
[rank0]: layer_kv_cache = torch.zeros(alloc_shape,
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.42 GiB. GPU 7 has a total capacity of 79.35 GiB of which 950.19 MiB is free. Process 412980 has 29.59 GiB memory in use. Process 412968 has 48.83 GiB memory in use. Of the allocated memory 48.26 GiB is allocated by PyTorch, and 73.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

glennccc · 2025-02-27T07:42:41Z

@Jintao-Huang @hjh0119 can you help me out?
If using 16 gpus(2nodes), how to change the scrpit?

hjh0119 · 2025-02-27T07:45:46Z

maybe similar issue ? #3300

glennccc · 2025-02-27T07:51:17Z

maybe similar issue ? #3300

not the same problem i think.
actually i use this script instead of swift rlhf, is it wrong with my scirpt?

python -m torch.distributed.run --nnode=$WORLD_SIZE --nproc_per_node=$NUM_PROCESSES --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT
path/swift/cli/rlhf.py
--rlhf_type grpo
.....

hjh0119 · 2025-02-27T07:58:09Z

Try using swift rlhf instead of python cli/rlhf.py. The swift rlhf command actually runs cli/main.py, which in turn executes torchrun to enable Distributed Data Parallel (DDP).

This might solve your problem.

glennccc · 2025-02-27T07:59:29Z

Try using swift rlhf instead of python cli/rlhf.py. The swift rlhf command actually runs cli/main.py, which in turn executes torchrun to enable Distributed Data Parallel (DDP).

This might solve your problem.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NNODES=2
export NODE_RANK=0
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=$MASTER_PORT
export NPROC_PER_NODE=7

glennccc · 2025-02-27T08:05:53Z

Try using swift rlhf instead of python cli/rlhf.py. The swift rlhf command actually runs cli/main.py, which in turn executes torchrun to enable Distributed Data Parallel (DDP).
This might solve your problem.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export NNODES=2 export NODE_RANK=0 export MASTER_ADDR=$MASTER_ADDR export MASTER_PORT=$MASTER_PORT export NPROC_PER_NODE=7

is it right? or do i need to change?

hjh0119 · 2025-02-27T09:27:52Z

maybe in node2, set the NPROC_PER_NODE=8

glennccc · 2025-02-27T09:57:13Z

maybe in node2, set the NPROC_PER_NODE=8

例子里我看设置的3，https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/multi_node1.sh，我理解是要留卡给vllm？set the NPROC_PER_NODE=8的话会报vllm oom的错。

hjh0119 · 2025-02-27T10:39:26Z

NPROC_PER_NODE=7 in node1, NPROC_PER_NODE=8 in node2 ?

glennccc · 2025-02-28T08:23:14Z

NPROC_PER_NODE=7 in node1, NPROC_PER_NODE=8 in node2 ?

目前我的训练平台上不支持两个节点提交不同的命令。那如果退其次，node1和node2都留一个gpu给vllm的话是否也可以呢？如何修改配置参数啊？

tastelikefeet · 2025-02-28T08:31:39Z

NPROC_PER_NODE=7 in node1, NPROC_PER_NODE=8 in node2 ?

目前我的训练平台上不支持两个节点提交不同的命令。那如果退其次，node1和node2都留一个gpu给vllm的话是否也可以呢？如何修改配置参数啊？

支持的，两个node都NPROC_PER_NODE=7即可

glennccc changed the title ~~multinode training cuda error~~ 【GRPO】multinode training cuda error Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【GRPO】multinode training cuda error #3307

【GRPO】multinode training cuda error #3307

glennccc commented Feb 27, 2025 •

edited

Loading

glennccc commented Feb 27, 2025

hjh0119 commented Feb 27, 2025

glennccc commented Feb 27, 2025

hjh0119 commented Feb 27, 2025

glennccc commented Feb 27, 2025

glennccc commented Feb 27, 2025

hjh0119 commented Feb 27, 2025

glennccc commented Feb 27, 2025

hjh0119 commented Feb 27, 2025

glennccc commented Feb 28, 2025

tastelikefeet commented Feb 28, 2025

【GRPO】multinode training cuda error #3307

【GRPO】multinode training cuda error #3307

Comments

glennccc commented Feb 27, 2025 • edited Loading

glennccc commented Feb 27, 2025

hjh0119 commented Feb 27, 2025

glennccc commented Feb 27, 2025

hjh0119 commented Feb 27, 2025

glennccc commented Feb 27, 2025

glennccc commented Feb 27, 2025

hjh0119 commented Feb 27, 2025

glennccc commented Feb 27, 2025

hjh0119 commented Feb 27, 2025

glennccc commented Feb 28, 2025

tastelikefeet commented Feb 28, 2025

glennccc commented Feb 27, 2025 •

edited

Loading