How to use tensor_parallel_size for vllm in GRPO? #260

bannima · 2025-02-10T07:17:07Z

GRPO use vllm to load reference model for data sampling , The limitation is that tensor parallel are not supported.
What if the reference model is larger than One GPU can hold, for example, 72B with 40GB's H800,

Is there any setting we can set the tensor_parallel_size for vllm params?

        if self.accelerator.is_main_process:
                vllm_device = self.args.vllm_device
                if vllm_device == "auto":
                    vllm_device = f"cuda:{self.accelerator.num_processes}"  # take the next GPU idx
                # Check that the requested device is available
                if vllm_device.split(":")[0] == "cuda" and int(vllm_device.split(":")[1]) >= torch.cuda.device_count():
                    raise ValueError(
                        f"The requested device for vllm ({vllm_device}) is not available. You are likely using vLLM "
                        "without restricting the number of GPUs for training. Set the `--num_processes` argument to a "
                        "value lower than the number of GPUs available on your machine—typically, reducing it by one "
                        f"is sufficient. In your case: `--num_processes {torch.cuda.device_count() - 1}`."
                    )
                # Check that the requested device is not also used for training
                if vllm_device in {f"cuda:{idx}" for idx in range(self.accelerator.num_processes)}:
                    warnings.warn(
                        f"The requested device {vllm_device} is also used for training. This may lead to unexpected "
                        "behavior. It is recommended to use a dedicated device for vLLM."
                    )
                # vLLM is not compatible with accelerate. So we need to patch it to make sure we can (1) place the vLLM
                # model on the desired device (world_size_patch) and (2) avoid a test that is not designed for our
                # setting (profiling_patch).
                world_size_patch = patch("torch.distributed.get_world_size", return_value=1)
                profiling_patch = patch(
                    "vllm.worker.worker.Worker._assert_memory_footprint_increased_during_profiling", return_value=None
                )
                with world_size_patch, profiling_patch:
                    self.llm = LLM(
                        model=model.name_or_path,
                        device=vllm_device,
                        gpu_memory_utilization=self.args.vllm_gpu_memory_utilization,
                        dtype=self.args.vllm_dtype,
                        # Automatic Prefix Caching caches the KV cache of existing queries, so that a new query can
                        # directly reuse the KV cache if it shares the same prefix with one of the existing queries.
                        # This is particularly useful here because we generate completions from the same prompts.
                        enable_prefix_caching=True,
                        max_model_len=self.args.vllm_max_model_len,
                    )
                self.sampling_params = SamplingParams(
                    temperature=args.temperature,
                    max_tokens=self.max_completion_length,
                )

The text was updated successfully, but these errors were encountered:

edbeeching · 2025-02-10T07:56:41Z

The current implementation limits the vllm instance to be on only 1 GPU. We are exploring other solutions, I will try to update here when we have something more scalable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use tensor_parallel_size for vllm in GRPO? #260

How to use tensor_parallel_size for vllm in GRPO? #260

bannima commented Feb 10, 2025

edbeeching commented Feb 10, 2025 •

edited

Loading

How to use tensor_parallel_size for vllm in GRPO? #260

How to use tensor_parallel_size for vllm in GRPO? #260

Comments

bannima commented Feb 10, 2025

edbeeching commented Feb 10, 2025 • edited Loading

edbeeching commented Feb 10, 2025 •

edited

Loading