[Draft] Add Tensor Parallel to torch_native_llama #1876

kwen2501 · 2024-11-02T00:47:38Z

Motivation

The torch_native_llama model does not have Tensor Parallel support today. This PR adds it, using torch.distributed APIs.

Modifications

Added a .tensor_parallel() utility;
Added ColwiseParallel and RowwiseParallel annotations to related sub-modules;

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

cc: @jerryzh168 @merrymercy @wz337

jerryzh168 · 2024-11-02T00:57:19Z

python/sglang/srt/models/torch_native_llama.py

@@ -495,8 +554,43 @@ def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
            param = self.lm_head.weight
            weight_loader = getattr(param, "weight_loader", default_weight_loader)
            weight_loader(param, self.model.embed_tokens.weight)
+
+        # Re-arrange fused matrix for TP
+        tp_size = get_tensor_model_parallel_world_size()


here, is it possible to do:

split qkv to 3 Tensors

apply tp to each of the Tensor

concat 3 Tensors to a single DTensor

This way we can rely on split/concat ops in DTensor itself instead of worrying about the implementation details?

Maybe? Although, at this location, we haven't applied TP yet, so there is no notion of DTensor.

Hmm, I see what you mean. We can use DTensor API instead of TP API (higher level) here.

In the newer version, I added support for TP'lized weight loading. Then we directly construct DTensor from the local shard. See the ColwiseParallelSharded strategy.

just to understand, currently step 2 is manual right?

Not manual per se. It is already packaged and can be called with parallelize_module as like other styles. So no evolvement needed from user or model author.

merrymercy · 2024-11-02T05:35:51Z

python/sglang/srt/model_executor/model_runner.py

@@ -153,6 +153,13 @@ def __init__(
        min_per_gpu_memory = self.init_torch_distributed()
        self.sampler = Sampler()
        self.load_model()
+        if self.tp_size > 1:
+            logger.info(f"Tensor parallelism is enabled, {self.tp_size} devices will be used.")


Will this break other models? Can we only do this for torch_native_llama?
For example, check hasattr(self.model, "tensor_parallel")

Good catch. Added a supports_torch_tp attr in model.

merrymercy · 2024-11-02T05:41:28Z

python/sglang/srt/models/torch_native_llama.py

        param_data = param.data
        param_data = param_data.narrow(0, shard_offset, shard_size)
        assert param_data.shape == loaded_weight.shape
        param_data.copy_(loaded_weight)
    return


+def shuffle_qkv_proj_weight(


This part still seems complicated. It would be good if we can have some high-level APIs to simplify this, as pointed by @jerryzh168

Yep, I agree. This part is now removed.

Move tp to utils Add ColwiseParallelSharded

kwen2501 · 2024-11-08T23:55:29Z

python/sglang/srt/models/torch_native_llama.py

+    # shard_id: (shard_offset, shard_size)
+    gate_up_offsets = {}
+    current_shard_offset = 0
+    for i, output_size in enumerate(self.output_sizes):
+        gate_up_offsets[i] = (current_shard_offset, output_size)
+        current_shard_offset += output_size
    if loaded_shard_id is None:
-        shard_offsets: List[Tuple[int, int, int]] = []
-        for i, output_size in enumerate(self.output_sizes):
-            shard_offsets.append((i, current_shard_offset, output_size))
-            current_shard_offset += output_size
-        for shard_id, shard_offset, shard_size in shard_offsets:
+        for shard_id, (shard_offset, shard_size) in gate_up_offsets.items():
            loaded_weight_shard = loaded_weight.narrow(
-                output_dim, shard_offset, shard_size
+                0, shard_offset, shard_size


These are style changes only.

kwen2501 · 2024-11-08T23:55:45Z

python/sglang/srt/models/torch_native_llama.py

+    # shard_id: (shard_offset, shard_size)
+    qkv_offsets = {
+        "q": (0, self.num_heads * self.head_size),
+        "k": (self.num_heads * self.head_size, self.num_kv_heads * self.head_size),
+        "v": ((self.num_heads + self.num_kv_heads) * self.head_size, self.num_kv_heads * self.head_size),
+    }
    if loaded_shard_id is None:
-        shard_offsets = [
-            # (shard_id, shard_offset, shard_size)
-            ("q", 0, self.total_num_heads * self.head_size),
-            (
-                "k",
-                self.total_num_heads * self.head_size,
-                self.total_num_kv_heads * self.head_size,
-            ),
-            (
-                "v",
-                (self.total_num_heads + self.total_num_kv_heads) * self.head_size,
-                self.total_num_kv_heads * self.head_size,
-            ),
-        ]
-        for shard_id, shard_offset, shard_size in shard_offsets:
+        for shard_id, (shard_offset, shard_size) in qkv_offsets.items():


These are style changes only.

python/sglang/srt/models/torch_native_llama.py

kwen2501 · 2024-11-08T23:57:12Z

python/sglang/srt/models/torch_native_llama.py

-        self.qkv_proj._get_shard_offset_mapping = types.MethodType(
-            _get_shard_offset_mapping, self.qkv_proj
-        )
-        self.qkv_proj._get_shard_size_mapping = types.MethodType(
-            _get_shard_size_mapping, self.qkv_proj
-        )


Not used now.

python/sglang/srt/models/torch_native_llama.py

jerryzh168

LGTM, I think this is the best we can do for now until we don't use fused qkv and rely on torch.compile for speedup

jerryzh168 · 2024-11-09T01:01:49Z

python/sglang/srt/models/torch_native_llama.py

@@ -176,7 +165,6 @@ def __init__(
    ) -> None:
        super().__init__()
        self.hidden_size = hidden_size
-        tp_size = get_tensor_model_parallel_world_size()
        self.total_num_heads = num_heads


I see it seems that we are already doing manual sharding here, I do feel this code should move to separate tp related code instead of being embedded in the model if possible

Yeah, it seems the "local" n_heads are needed for constructing the RadixAttention later.

self.attn = RadixAttention( self.num_heads, ...

I am not sure if I can remove it given that it involves a contract change with that module.

Add Tensor Parallel to torch_native_llama

d25b19e

kwen2501 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners November 2, 2024 00:47

kwen2501 marked this pull request as draft November 2, 2024 00:48

jerryzh168 reviewed Nov 2, 2024

View reviewed changes

merrymercy force-pushed the main branch from 55311eb to 2134f08 Compare November 2, 2024 01:26

merrymercy reviewed Nov 2, 2024

View reviewed changes

kwen2501 added 2 commits November 8, 2024 15:31

Support loading in sharded mode

ebb4c75

Move tp to utils Add ColwiseParallelSharded

Add supports_torch_tp gate

e78520c

kwen2501 commented Nov 8, 2024

View reviewed changes

Add torch version

11a153c

jerryzh168 reviewed Nov 9, 2024

View reviewed changes

python/sglang/srt/models/torch_native_llama.py Outdated Show resolved Hide resolved

Modularize TP application

5cc3ca6

jerryzh168 approved these changes Nov 9, 2024

View reviewed changes

kwen2501 marked this pull request as ready for review November 9, 2024 00:38

jerryzh168 reviewed Nov 9, 2024

View reviewed changes

Move tp_size to weight loader

ee80b5d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Add Tensor Parallel to torch_native_llama #1876

[Draft] Add Tensor Parallel to torch_native_llama #1876

kwen2501 commented Nov 2, 2024 •

edited

Loading

jerryzh168 Nov 2, 2024

kwen2501 Nov 2, 2024 •

edited

Loading

kwen2501 Nov 2, 2024

kwen2501 Nov 8, 2024

jerryzh168 Nov 9, 2024

kwen2501 Nov 9, 2024 •

edited

Loading

merrymercy Nov 2, 2024 •

edited

Loading

kwen2501 Nov 8, 2024

merrymercy Nov 2, 2024

kwen2501 Nov 8, 2024

kwen2501 Nov 8, 2024

kwen2501 Nov 8, 2024

kwen2501 Nov 8, 2024

jerryzh168 left a comment

jerryzh168 Nov 9, 2024

kwen2501 Nov 9, 2024

[Draft] Add Tensor Parallel to torch_native_llama #1876

Are you sure you want to change the base?

[Draft] Add Tensor Parallel to torch_native_llama #1876

Conversation

kwen2501 commented Nov 2, 2024 • edited Loading

Motivation

Modifications

Checklist

Choose a reason for hiding this comment

kwen2501 Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwen2501 Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

merrymercy Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryzh168 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwen2501 commented Nov 2, 2024 •

edited

Loading

kwen2501 Nov 2, 2024 •

edited

Loading

kwen2501 Nov 9, 2024 •

edited

Loading

merrymercy Nov 2, 2024 •

edited

Loading