-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【GRPO】multinode training cuda error #3307
Comments
@Jintao-Huang @hjh0119 can you help me out? |
maybe similar issue ? #3300 |
not the same problem i think. python -m torch.distributed.run --nnode=$WORLD_SIZE --nproc_per_node=$NUM_PROCESSES --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT |
Try using swift rlhf instead of python cli/rlhf.py. The swift rlhf command actually runs cli/main.py, which in turn executes torchrun to enable Distributed Data Parallel (DDP). This might solve your problem. |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 |
is it right? or do i need to change? |
maybe in node2, set the NPROC_PER_NODE=8 |
例子里我看设置的3,https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/multi_node1.sh,我理解是要留卡给vllm?set the NPROC_PER_NODE=8的话会报vllm oom的错。 |
NPROC_PER_NODE=7 in node1, NPROC_PER_NODE=8 in node2 ? |
目前我的训练平台上不支持两个节点提交不同的命令。那如果退其次,node1和node2都留一个gpu给vllm的话是否也可以呢?如何修改配置参数啊? |
支持的,两个node都NPROC_PER_NODE=7即可 |
2 nodes (16 A100 GPUs)
training script is from https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/multi_node1.sh
swift rlhf
--rlhf_type grpo
--model path/Qwen2.5-7B-Instruct
--reward_funcs accuracy format
--use_vllm true
--vllm_device auto
--vllm_gpu_memory_utilization 0.5
--vllm_max_model_len 4096
--train_type full
--torch_dtype bfloat16
--dataset 'path/dataset/MATH-lighteval'
--max_completion_length 2048
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 2
--eval_steps 200
--save_steps 200
--save_total_limit 2
--logging_steps 5
--max_length 4096
--output_dir output/Qwen2.5-7B-Instruct_GRPO
--warmup_ratio 0.05
--dataloader_num_workers 4
--dataset_num_proc 4
--num_generations 8
--temperature 0.9
--system 'examples/train/grpo/prompt.txt'
--deepspeed zero2
--log_completions true
==========================================================================================
however, get cuda out of memory error. why your script on 2 node ( 4 gpus each) is ok, but 16 gpus get these error?
[rank0]: engine = AsyncLLMEngine.from_engine_args(self.engine_args)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 644, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 594, in init
[rank0]: self.engine = self._engine_class(*args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 267, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 276, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 429, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 121, in initialize_cache
[rank0]: self.collective_rpc("initialize_cache",
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/utils.py", line 2220, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 306, in initialize_cache
[rank0]: self._init_cache_engine()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 311, in _init_cache_engine
[rank0]: self.cache_engine = [
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 312, in
[rank0]: CacheEngine(self.cache_config, self.model_config,
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 69, in init
[rank0]: self.gpu_cache = self._allocate_kv_cache(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 103, in _allocate_kv_cache
[rank0]: layer_kv_cache = torch.zeros(alloc_shape,
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.42 GiB. GPU 7 has a total capacity of 79.35 GiB of which 950.19 MiB is free. Process 412980 has 29.59 GiB memory in use. Process 412968 has 48.83 GiB memory in use. Of the allocated memory 48.26 GiB is allocated by PyTorch, and 73.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The text was updated successfully, but these errors were encountered: