[Core][Distributed] merge two broadcast_tensor_dict #5354

youkaichao · 2024-06-08T00:42:30Z

We have two broadcast_tensor_dict call, one in worker, one in model_runner. We can merge them together in one broadcast_tensor_dict call.

vllm/worker/embedding_model_runner.py

vllm/worker/worker.py

vllm/worker/model_runner.py

youkaichao · 2024-06-09T07:08:04Z

@zhuohan123 PTAL, if you agree with this design, I can refactor the rest model runners as well.

youkaichao · 2024-06-10T05:19:41Z

the performance gain seems to be non-trivial:

with this PR:
Throughput: 31.08 requests/s, 15911.95 tokens/s
Throughput: 31.22 requests/s, 15986.09 tokens/s
Throughput: 31.18 requests/s, 15963.27 tokens/s

without this PR:
Throughput: 30.77 requests/s, 15752.00 tokens/s
Throughput: 30.58 requests/s, 15656.67 tokens/s
Throughput: 30.52 requests/s, 15626.47 tokens/s

script:

python benchmarks/benchmark_throughput.py --output-len 256 --input 256 --model meta-llama/Llama-2-7b-hf -tp 8

machine: 8*H100

zhuohan123

Hey still a general comment can we make the code more specific? Why do we design the function arguments with a general name like aux?

vllm/worker/worker.py

zhuohan123 · 2024-06-10T18:13:09Z

vllm/worker/embedding_model_runner.py

@@ -47,12 +46,14 @@ def __init__(
    @torch.inference_mode()
    def execute_model(
        self,
-        seq_group_metadata_list: Optional[List[SequenceGroupMetadata]],
+        broadcast_inputs: Dict[str, Any],
+        aux: Optional[List[Any]],


Why do we add an aux argument? I feel like this makes arguments more confusing.

how about the current design:

def execute_model( self, modelrunner_input: ModelRunnerInput, kv_caches: List[torch.Tensor], ) -> Optional[SamplerOutput]:

youkaichao · 2024-06-10T18:17:25Z

can we make the code more specific? Why do we design the function arguments with a general name like aux?

i feel this is quite general? the driver worker prepares input, separate it into objects to broadcast, and objects to keep for itself (i.e. the aux).

zhuohan123 · 2024-06-10T18:25:57Z

But the method is called execute_model. I think my confusion is this: The function name is pretty specific (execute_model), but the arguments are very general: broadcast_inputs and aux.

youkaichao · 2024-06-10T18:43:17Z

How about this design:

seq_group_metadata_list  --prepare_inputs_to_broadcast--> tensor_dict_to_broadcast, auxiliary_data --convert_broadcast_inputs_to_model_input--> model_input --execute_model--> output

for non-driver workers:

broadcast_tensor_dict, None  --convert_broadcast_inputs_to_model_input--> model_input --execute_model--> output

njhill

Thanks @youkaichao I am very in favor of this (had been planning to do it myself at some point)!

Some other thoughts:

This halves the number of broadcasts done, we can halve again by adding [Core] Avoid one broadcast op when propagating metadata #4844 or equivalent, which I expect would give additional non-negligible latency benefit
Spec decoding adds another per-step broadcast which I hope can be similarly coalesced with this one

njhill · 2024-06-10T21:26:29Z

vllm/worker/model_runner.py

+        return metadata_dict, [sampling_metadata]
+
+    def convert_broadcast_inputs_to_modelrunner_input(
+            self, metadata_dict: Dict[str, Any], aux: Optional[List[Any]]):


Add return typing?

I delete it to avoid typing conflict. This function convert_broadcast_inputs_to_modelrunner_input is implemented in embedding_model_runner.py as well, but the return types differ (although both are named ModelRunnerInput, they live in different modules, so mypy complains about it).

Suggestions are welcome for fixing it while making mypy happy.

njhill · 2024-06-10T21:40:49Z

vllm/worker/model_runner.py

-            if attn_metadata:
-                metadata_dict.update(attn_metadata.asdict_zerocopy())
-            broadcast_tensor_dict(metadata_dict, src=0)
+    def prepare_inputs_to_broadcast(


I wonder whether we should have this return ModelRunnerInput and then add a method to that get_dict_to_broadcast() which you only call in the TP > 1 case. This would be more efficient for the non-TP case.

Then this method could be called prepare_modelrunner_input() which would make more sense to me since it's used even for non-TP where broadcasting isn't being done.

That would also obviate the need for this separate aux arg/variable.

The problem is who takes the responsibility to drive the broadcast process and how does it know the function to call, given that we have inheritance in model_runner.py and embedding_model_runner.py .

Previously, each model runner drives the broadcast operation itself, inside execute_model, which leads to a separate broadcast.

In this PR, worker drives the broadcast operation, so it needs ModelRunner to return inputs_to_broadcast, broadcast it, and then feed it back to ModelRunner.

If we use ModelRunnerInput. get_dict_to_broadcast , how can the worker know which ModelRunnerInput to use? We have multiple ModelRunnerInput, used in both model_runner.py and embedding_model_runner.py.

If we use ModelRunnerInput. get_dict_to_broadcast , how can the worker know which ModelRunnerInput to use? We have multiple ModelRunnerInput, used in both model_runner.py and embedding_model_runner.py.

You kind of answered your own question :) ... both ModelRunnerInputs implement that and the worker just calls it (could be a protocol)

this can be solved by adding an abstract class and let ModelRunnerInput inherit the abstract class. For reasons I don't know, inheritance is discouraged in vllm, and several modelrunners just duplicate the code.

Inheritance is used in various places already. IMO it makes sense to use judiciously but to avoid overdoing it.

In any case that's why I suggested to use a protocol here, you can do the same thing without inheritance.

youkaichao · 2024-06-10T21:59:44Z

we can halve again by adding #4844 or equivalent

I'm planning to use share memory for transport, so that we don't need to broadcast metadata at all. Only tensors will be broadcasted.

Spec decoding adds another per-step broadcast which I hope can be similarly coalesced with this one

Sure, we can do it in a followup PR.

youkaichao · 2024-06-26T03:47:02Z

close as #5408 is a superset of this pr.

youkaichao added 11 commits June 7, 2024 14:44

stash

a191485

update execute_model

313388b

use tensor_dict for input

8bee1f1

update worker

7a7f535

fix profile run

669bcb3

remove broadcast

88c6e90

Merge branch 'main' into merge_broadcast

2164764

fix sampling metadata

6c230e4

comment

0a48915

update embedding model runner

3a1df09

sort input

72a1a4b

zhuohan123 reviewed Jun 8, 2024

View reviewed changes

youkaichao added 7 commits June 7, 2024 23:22

broadcast and then swap

665fb0d

Merge branch 'main' into merge_broadcast

07a3fe4

use prepare_inputs_to_broadcast

ebf0c4c

update embedding model

fe6d90d

fix execute_model

1279cb5

add comments

7a3d62d

fix data

6c59994

youkaichao requested a review from zhuohan123 June 9, 2024 07:07

njhill self-requested a review June 10, 2024 16:40

zhuohan123 reviewed Jun 10, 2024

View reviewed changes

youkaichao added 3 commits June 10, 2024 11:55

use modelrunner input

5bfda47

update embedding model

be107d9

remove duplicate prepare_modelrunner_input in embedding model runner

cb691ad

youkaichao added 2 commits June 10, 2024 12:09

remove noqa

ebc7bbf

fix deadlock

87f68ea

youkaichao mentioned this pull request Jun 10, 2024

Bump version to v0.5.0 #5384

Merged

njhill reviewed Jun 10, 2024

View reviewed changes

youkaichao mentioned this pull request Jun 11, 2024

[Core] Refactor Worker and ModelRunner to consolidate control plane communication #5408

Merged

2 tasks

youkaichao closed this Jun 26, 2024

youkaichao deleted the merge_broadcast branch June 26, 2024 03:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Distributed] merge two broadcast_tensor_dict #5354

[Core][Distributed] merge two broadcast_tensor_dict #5354

youkaichao commented Jun 8, 2024

youkaichao commented Jun 9, 2024

youkaichao commented Jun 10, 2024

zhuohan123 left a comment

zhuohan123 Jun 10, 2024

youkaichao Jun 10, 2024

youkaichao commented Jun 10, 2024

zhuohan123 commented Jun 10, 2024

youkaichao commented Jun 10, 2024

njhill left a comment

njhill Jun 10, 2024

youkaichao Jun 10, 2024

njhill Jun 10, 2024

youkaichao Jun 10, 2024

njhill Jun 10, 2024

youkaichao Jun 10, 2024

njhill Jun 10, 2024 •

edited

Loading

youkaichao commented Jun 10, 2024

youkaichao commented Jun 26, 2024

[Core][Distributed] merge two broadcast_tensor_dict #5354

[Core][Distributed] merge two broadcast_tensor_dict #5354

Conversation

youkaichao commented Jun 8, 2024

youkaichao commented Jun 9, 2024

youkaichao commented Jun 10, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youkaichao commented Jun 10, 2024

zhuohan123 commented Jun 10, 2024

youkaichao commented Jun 10, 2024

njhill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njhill Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

youkaichao commented Jun 10, 2024

youkaichao commented Jun 26, 2024

njhill Jun 10, 2024 •

edited

Loading