[GRPO] Allow the use of the vllm logprobs, rather than recomputing them #3193

edbeeching · 2025-03-31T12:36:32Z

This PR exposes the logprobs of token IDs that were generated from the vllm client.
The default will be false, as they are likely different to those produced by the policy.

HuggingFaceDocBuilderDev · 2025-03-31T12:40:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun

Overall looks good with a potential typo in the padding of the log probs. It would be quite interesting to know if there is any noticeable difference between using the vLLM logprobs vs native ones when beta > 0 (not for this PR, but more generally on some simple baseline)

lewtun · 2025-03-31T12:40:13Z

trl/trainer/grpo_config.py

@@ -81,6 +81,8 @@ class GRPOConfig(TrainingArguments):
        use_vllm (`bool`, *optional*, defaults to `False`):
            Whether to use vLLM for generating completions. If set to `True`, ensure that a GPU is kept unused for
            training, as vLLM will require one for generation. vLLM must be installed (`pip install vllm`).
+        use_vllm_logprobs (`bool`, *optional*, defaults to `False`):
+            Whether to use vLLM's logprobs for the `"old_logprobs"` in the GRPO loss. Requires `use_vllm=True`.


nit:

Suggested change

Whether to use vLLM's logprobs for the `"old_logprobs"` in the GRPO loss. Requires `use_vllm=True`.

Whether to use vLLM's logprobs to compute the GRPO loss instead of using the native `forward()` method. This is more compute efficient because vLLM computes the policy's logprobs in parallel to completions. Requires use_vllm=True.

Thanks. By the way, I think there may be some confusion, this PR is not related to the ref model at all. Previously the old_logprobs were (re)calculated from the model we are optimizing in torch.no_grad mode, using the tokens generated from the vllm instance. They are used as part of GRPO's clipping loss to constrain the model's updates to be within beta of the old model.

Now we use the logprobs from the vllm instance, I believe you are thinking of the KL penalty, which still uses the log_probs from the refmodel and the model we are optimizing. Not the vllm logprobs.

Ah yeah you're right - I've amended my suggestion to align with that. (I think it's not a good idea to reference internal variables in docstrings as the user cannot see them unless they drill into the code)

lewtun · 2025-03-31T12:40:27Z

trl/trainer/grpo_config.py

+    use_vllm_logprobs: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether to use vLLM's logprobs for the 'old_logprobs' in the GRPO loss. Requires use_vllm=True."


Suggested change

"help": "Whether to use vLLM's logprobs for the 'old_logprobs' in the GRPO loss. Requires use_vllm=True."

Whether to use vLLM's logprobs to compute the GRPO loss instead of using the native `forward()` method. This is more compute efficient because vLLM computes the policy's logprobs in parallel to completions. Requires use_vllm=True.

trl/trainer/grpo_trainer.py

lewtun · 2025-03-31T12:42:30Z

trl/trainer/grpo_trainer.py

            # Broadcast the completions from the main process to all processes, ensuring each process receives its
            # corresponding slice.
            completion_ids = broadcast_object_list(completion_ids, from_process=0)
+            vllm_log_probs = broadcast_object_list(vllm_log_probs, from_process=0)


Do we get a performance hit from broadcasting these arrays when use_vllm_logprobs=False or is it marginal?

Most likely marginal as the GPUs all already in sync due to the gather of completion_ids.

qgallouedec

👌

qgallouedec · 2025-03-31T14:25:12Z

trl/trainer/grpo_trainer.py


            # Pad the completions, and concatenate them with the prompts
            completion_ids = [torch.tensor(ids, device=device) for ids in completion_ids]
            completion_ids = pad(completion_ids, padding_value=self.processing_class.pad_token_id)
+            vllm_log_probs = [torch.tensor(logp, device=device) for logp in vllm_log_probs]
+            vllm_log_probs = pad(vllm_log_probs, padding_value=self.processing_class.pad_token_id)


Suggested change

vllm_log_probs = pad(vllm_log_probs, padding_value=self.processing_class.pad_token_id)

vllm_log_probs = pad(vllm_log_probs, padding_value=0.0)

lewtun

Thanks for the clarification about how the logprobs are used in the loss. LGTM with the minor tweak to the docstring

qgallouedec · 2025-03-31T20:47:45Z

trl/scripts/vllm_serve.py

        {"completion_ids": [[101, 102, 103], [201, 202, 203]]}
+        {"log_probs": [[1.1, 1.2, 1.3], [2.1, 2.2, 2.3]]}


Suggested change

{"completion_ids": [[101, 102, 103], [201, 202, 203]]}

{"log_probs": [[1.1, 1.2, 1.3], [2.1, 2.2, 2.3]]}

{"completion_ids": [[101, 102, 103], [201, 202, 203]],

"log_probs": [[1.1, 1.2, 1.3], [2.1, 2.2, 2.3]]}

edbeeching · 2025-04-02T07:03:48Z

I benchmarked this, and the performance is comparable but speed improvement is marginal, so it's not worth adding the complexity to the codebase.

edbeeching added 2 commits March 31, 2025 12:33

expose vllm logprobs

05d6d4e

expose vlm logprobs

b54b021

edbeeching requested review from qgallouedec, lewtun, kashif and shirinyamani March 31, 2025 12:36

precommit

dcca9e9

lewtun requested changes Mar 31, 2025

View reviewed changes

fix vllm lopprob bug

af13f06

qgallouedec approved these changes Mar 31, 2025

View reviewed changes

lewtun approved these changes Mar 31, 2025

View reviewed changes

fix a few bugs

6df27b8

qgallouedec reviewed Mar 31, 2025

View reviewed changes

remove print

7ebd6fd

edbeeching closed this Apr 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GRPO] Allow the use of the vllm logprobs, rather than recomputing them #3193

[GRPO] Allow the use of the vllm logprobs, rather than recomputing them #3193

edbeeching commented Mar 31, 2025

HuggingFaceDocBuilderDev commented Mar 31, 2025

lewtun left a comment

lewtun Mar 31, 2025 •

edited

Loading

edbeeching Mar 31, 2025

lewtun Mar 31, 2025

lewtun Mar 31, 2025 •

edited

Loading

lewtun Mar 31, 2025

edbeeching Mar 31, 2025

qgallouedec left a comment

qgallouedec Mar 31, 2025

lewtun left a comment

qgallouedec Mar 31, 2025

edbeeching commented Apr 2, 2025

	Whether to use vLLM's logprobs for the `"old_logprobs"` in the GRPO loss. Requires `use_vllm=True`.
	Whether to use vLLM's logprobs to compute the GRPO loss instead of using the native `forward()` method. This is more compute efficient because vLLM computes the policy's logprobs in parallel to completions. Requires use_vllm=True.

	"help": "Whether to use vLLM's logprobs for the 'old_logprobs' in the GRPO loss. Requires use_vllm=True."
	Whether to use vLLM's logprobs to compute the GRPO loss instead of using the native `forward()` method. This is more compute efficient because vLLM computes the policy's logprobs in parallel to completions. Requires use_vllm=True.

	vllm_log_probs = pad(vllm_log_probs, padding_value=self.processing_class.pad_token_id)
	vllm_log_probs = pad(vllm_log_probs, padding_value=0.0)

		{"completion_ids": [[101, 102, 103], [201, 202, 203]]}
		{"log_probs": [[1.1, 1.2, 1.3], [2.1, 2.2, 2.3]]}

[GRPO] Allow the use of the vllm logprobs, rather than recomputing them #3193

[GRPO] Allow the use of the vllm logprobs, rather than recomputing them #3193

Conversation

edbeeching commented Mar 31, 2025

HuggingFaceDocBuilderDev commented Mar 31, 2025

lewtun left a comment

Choose a reason for hiding this comment

lewtun Mar 31, 2025 • edited Loading

Choose a reason for hiding this comment

edbeeching Mar 31, 2025

Choose a reason for hiding this comment

lewtun Mar 31, 2025

Choose a reason for hiding this comment

lewtun Mar 31, 2025 • edited Loading

Choose a reason for hiding this comment

lewtun Mar 31, 2025

Choose a reason for hiding this comment

edbeeching Mar 31, 2025

Choose a reason for hiding this comment

qgallouedec left a comment

Choose a reason for hiding this comment

qgallouedec Mar 31, 2025

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

qgallouedec Mar 31, 2025

Choose a reason for hiding this comment

edbeeching commented Apr 2, 2025

lewtun Mar 31, 2025 •

edited

Loading

lewtun Mar 31, 2025 •

edited

Loading