mthreads: support multi-backend and introduce mthreads backend #458

machuanjiang · 2025-02-25T02:53:26Z

PR Category

OP Test

Type of Change

Refactor

Description

adapt mthreads backend in multi-backend

Issue

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance

added performance test adaptation for MT GPU backend

Signed-off-by: Jian Li <[email protected]>

config: {BLOCK_M: 8, num_warps: 8} will cause the number of registers within a single thread to be exceeded when the tensor shape is 4096 * 2304, so reduce BLOCK_M to 4 to supprot cumsum.

Signed-off-by: Jian Li <[email protected]>

- Torch_musa does not support fp64 input type, so CPU is used as a reference

- Does not support test_accuracy_groupnorm - Some use cases have accuracy issues in test_embedding

Signed-off-by: Jian Li <[email protected]>

Signed-off-by: jiaqi.wang <[email protected]>

… ops

* add gather_backward op * add debug log in * perf gather backward * rebased with master * scatter rewrite done. * scatter handling internally overlapping input. * Scatter reduce now uses atomics. * remove fp16 from scatter reduce UT. * sets threadblock size to 128 for scatter. * Change atomic memory order to relaxed in scatter. --------- Co-authored-by: awayzjj <[email protected]> Co-authored-by: StrongSpoon <[email protected]>

1. update multi-backend code 2. fix argmin op might test failed under int types Co-authored-by: mx-flaggems-user <[email protected]>

Co-authored-by: junjian.zhan <[email protected]>

Signed-off-by: jiaqi.wang <[email protected]>

1. resolve_conj: ref to this link: https://jira.mthreads.com/browse/MTAI-1530 2. fill: torch_musa does not support case torch.fill(dtype=cpu, dtype=musa).

* add backward of conv2d * delete useless code * format code of tests * modify configs for tuning * modify autotune config * delete test flag * delete useless type convert --------- Co-authored-by: Jiang Bin <[email protected]>

Signed-off-by: jiaqi.wang <[email protected]>

Signed-off-by: jiaqi.wang <jiaqi.wang @mthreads.com>

… into mthreads/master-250225

Signed-off-by: jiaqi.wang <jiaqi.wang @mthreads.com>

Galaxy1458 · 2025-02-26T02:25:04Z

tests/mt_test_entry.py

@@ -0,0 +1,70 @@
+import argparse


This file appears to be a document for internal use by the mthreads. Is there a necessity to store it on the master branch?

yes, just for internal usage, we will remove it

Galaxy1458 · 2025-02-26T02:25:29Z

tests/test_50_ops.py

@@ -0,0 +1,100 @@
+import os


just for internal usage, we will remove it

Galaxy1458 · 2025-02-26T02:29:14Z

tests/accuracy_utils.py

@@ -138,6 +138,19 @@ def to_reference(inp, upcast=False):
    return ref_inp


+def to_reference_fp64(inp, upcast=False):


Where is this to_reference_fp64 function used?

we will remove it.

Galaxy1458 · 2025-02-26T02:35:32Z

src/flag_gems/utils/codegen_config_utils.py

 }

 HEURISTICS_CONFIG = {
    vendors.NVIDIA: default_heuristics_for_num_warps,
    vendors.METAX: metax_heuristics_for_num_warps,
    vendors.CAMBRICON: cambricon_heuristics_for_num_warps,
+    vendors.MTHREADS: default_heuristics_for_num_warps,


No need to add default_heuristics_for_num_warps here and default is default_heuristics_for_num_warps

I'll remove it.

Galaxy1458 · 2025-02-26T02:37:01Z

src/flag_gems/utils/pointwise_dynamic.py

@@ -1071,7 +1071,8 @@ def __init__(self, op_desc: FunctionSchema, scalar_fn: JITFunction, config=None)

        assert isinstance(scalar_fn, JITFunction)
        self._scalar_fn = scalar_fn
-        self._scalar_fn_cache_key = scalar_fn.cache_key
+        # FIXME: cache_key is too long and make open file failed.
+        self._scalar_fn_cache_key = scalar_fn.cache_key[:33]


Is this problem only on mthread machines?

not occurred any more, I'll cancel this change.

src/flag_gems/runtime/backend/__init__.py

Galaxy1458 · 2025-02-26T05:41:29Z

src/flag_gems/ops/weightnorm.py

This is empty file.

file mod changed to 755, I'll change back to 644

Galaxy1458 · 2025-02-26T05:43:22Z

src/flag_gems/ops/unique.py

@@ -370,7 +370,7 @@ def sorted_quick_unique_flat(sorted_data: torch.Tensor, return_counts: bool):
    next_power_global_ctas_num = triton.next_power_of_2(global_ctas_num)
    ctas_num = global_ctas_num if global_ctas_num < 65536 else 2048
    tiles_per_cta = triton.cdiv(num_tasks, tile_size * ctas_num)
-    num_warps = 8 if tiles_per_cta == 1 else 32
+    num_warps = 8 if tiles_per_cta == 1 else 8


If this is a special change to mthreads, to not affect performance, please put it in the mthreads directory

ok, we'll move to vendor ops directory

Galaxy1458 · 2025-02-26T05:45:44Z

src/flag_gems/ops/isin.py

@@ -217,11 +217,11 @@ def isin_by_search(
    elif M <= 4194304:  # 2 ** 22 = 1024 * 4096
        _, BLOCK_M, num_warps = launch_arg(None, 1024, M, 8)
    elif M <= 8388608:  # 2 ** 23 = 1024 * 8192
-        _, BLOCK_M, num_warps = launch_arg(None, 2048, M, 16)
+        _, BLOCK_M, num_warps = launch_arg(None, 2048, M, 8)


If this is a special change to mthreads, to not affect performance, please put it in the mthreads directory

Galaxy1458 · 2025-02-26T05:49:55Z

examples/model_bert_test.py

@@ -13,7 +13,8 @@
    "prompt",
    ["How are you today?", "What is your name?", "Who are you?", "Where are you from?"],
 )
-@pytest.mark.parametrize("dtype", [torch.float16, torch.float32, torch.bfloat16])
+# @pytest.mark.parametrize("dtype", [torch.float16, torch.float32, torch.bfloat16])
+@pytest.mark.parametrize("dtype", [torch.float16, torch.float32])


mthreads can choose to skip it. Don't change the original logic

ok, this file seems an example, no effect to testing, we choose to keep it original

kiddyjinjin · 2025-02-26T02:41:19Z

benchmark/test_special_perf.py

@@ -40,8 +42,9 @@ def resolve_conj_input_fn(shape, dtype, device):
    # Sorting Operations
    ("topk", torch.topk, FLOAT_DTYPES, topk_input_fn),
    # Complex Operations
-    ("resolve_neg", torch.resolve_neg, [torch.cfloat], resolve_neg_input_fn),
-    ("resolve_conj", torch.resolve_conj, [torch.cfloat], resolve_conj_input_fn),


Only disabled when device == ”musa“

Only disabled when device == ”musa“

@kiddyjinjin ok, how about change to:

special_operations = [ # Sorting Operations ("topk", torch.topk, FLOAT_DTYPES, topk_input_fn), # Complex Operations ("resolve_neg", torch.resolve_neg, [torch.cfloat], resolve_neg_input_fn) if flag_gems.device_name != 'musa' else (), ("resolve_conj", torch.resolve_conj, [torch.cfloat], resolve_conj_input_fn) if flag_gems.device_name != 'musa' else (), ]

kiddyjinjin · 2025-02-26T02:43:26Z

benchmark/test_reduction_perf.py

    ("amax", torch.amax, FLOAT_DTYPES),
-    ("any", torch.any, FLOAT_DTYPES),
+    # ("any", torch.any, FLOAT_DTYPES),  # mt not support, disable


if device == ”musa“：
forward_operations = []
else:
forward_operations = []

if device == ”musa“： forward_operations = [] else: forward_operations = []

how about change to:

forward_operations = [ ("all", torch.all, FLOAT_DTYPES) if flag_gems.device_name != 'musa' else (), ("amax", torch.amax, FLOAT_DTYPES), ("any", torch.any, FLOAT_DTYPES) if flag_gems.device_name != 'musa' else (), ... ]

kiddyjinjin · 2025-02-26T02:48:08Z

benchmark/test_binary_pointwise_perf.py

-            ("floor_divide", torch.floor_divide, INT_DTYPES),
-            ("remainder", torch.remainder, INT_DTYPES),
+            # ("floor_divide", torch.floor_divide, INT_DTYPES),  # mt not support, disable
+            # ("remainder", torch.remainder, INT_DTYPES),  # mt not support, disable


distinguish the device information here

kiddyjinjin · 2025-02-26T02:53:30Z

benchmark/attri_util.py

@@ -72,6 +72,8 @@ class BenchmarkMetrics:
    tflops: Optional[float] = None
    # Utilization (not implemented yet)
    utilization: Optional[float] = None
+    # Speedup compared to base data
+    compared_speedup: Optional[float] = None


What’s the difference between 'compared_speedup' and 'speedup'

input 2 perf log files, to calc the speedup of A against B, for example, it is used to compare the absolute latency between triton flaggems running on MTGPU and on NVGPU, to get the speedup

"I understand. Could you add a comment to indicate that this field is used in the summary_for_plot script to calculate the speedup across log files, to avoid confusion for those reviewing the code?

Galaxy1458 · 2025-02-26T08:20:44Z

tests/test_blas_ops.py

@@ -119,6 +119,7 @@ def test_accuracy_outer(M, N, dtype):
    gems_assert_close(res_in2_grad, ref_in2_grad, dtype, reduce_dim=M)


+@pytest.mark.skip("Segmentation fault")


Signed-off-by: chuanjiang.ma <[email protected]>

Galaxy1458

lgtm

yuzhe-wu and others added 30 commits February 20, 2025 17:33

add triton_musa submodule

0c4a2df

Modify testcase from cuda to musa.

466f33d

Workaround for musa testcase.

7bc8ebe

modify test_unary_pointwise_ops from cuda to musa

3114aac

modify test_reduction_ops from cuda to musa

67e5caa

Fix bug of reduceOp and shared memory.

160f6fd

fix dropout bug.

80e54c2

fix softmax exceeds shared memory error

dacc964

Promote cpu reference accuracy to float32

1f7a987

Modify benchmark performance test script to musa

7360832

Add torch_musa unsupported op test case.

13040a9

Support bert model.

b8a63fc

Update submodule url.

7010757

Comment v3/v4 test case.

6e8bfef

fix: vectornorm upcast to fp64

6cc9ec5

fix: group_norm modify case because of out-of-shared-memory

be9d256

Promote golden accuracy from fp32 to fp64.

88f8cba

Rebase on master commit 1e49d6.

4f61671

Update triton_musa submodule.

7651314

align perf utils to profiling pack 0717

d307f34

rebase on master commit 9000685

006ab94

Signed-off-by: Jian Li <[email protected]>

Support op cumsum.

950de13

config: {BLOCK_M: 8, num_warps: 8} will cause the number of registers within a single thread to be exceeded when the tensor shape is 4096 * 2304, so reduce BLOCK_M to 4 to supprot cumsum.

fix embedding tensor usage

f66416b

Signed-off-by: Jian Li <[email protected]>

uncomment supported op test

f3c149a

Signed-off-by: Jian Li <[email protected]>

Support isclose() and allclose()

d5eaa25

- Torch_musa does not support fp64 input type, so CPU is used as a reference

Open up some tests that have already passed

155a0e5

- Does not support test_accuracy_groupnorm - Some use cases have accuracy issues in test_embedding

rebase on master commit 801377f

77b3f00

Signed-off-by: Jian Li <[email protected]>

rebase on master commit 2e55d66

d94c932

Signed-off-by: Jian Li <[email protected]>

fix distribution ops warps num

0cc8e66

Signed-off-by: Jian Li <[email protected]>

rebase on master commit a156268

8ec9d10

Signed-off-by: Jian Li <[email protected]>

StrongSpoon and others added 16 commits February 25, 2025 08:02

[benchmark] skip perf test of cummin when triton < 3.0 (#385)

e4d4f27

SW-46093: fix test configuration

b53cc37

Signed-off-by: jiaqi.wang <[email protected]>

SW-50265: optimize the perf of gelu, tanh op and most other pointwise…

69fef2f

… ops

[Operator]Fix metax backend bugs (#432)

1cb5c60

1. update multi-backend code 2. fix argmin op might test failed under int types Co-authored-by: mx-flaggems-user <[email protected]>

[TEST] fix error in argmin UT when dtype=int16 (#431)

d73f52d

Co-authored-by: junjian.zhan <[email protected]>

SW-49945: Update configurations for multi-backends

cff6a3a

Signed-off-by: jiaqi.wang <[email protected]>

SW-47833: Fix case resolve_conj and fill.

d615e81

1. resolve_conj: ref to this link: https://jira.mthreads.com/browse/MTAI-1530 2. fill: torch_musa does not support case torch.fill(dtype=cpu, dtype=musa).

add backward of conv2d (#365)

917d5ca

* add backward of conv2d * delete useless code * format code of tests * modify configs for tuning * modify autotune config * delete test flag * delete useless type convert --------- Co-authored-by: Jiang Bin <[email protected]>

SW-52470: Cherry-pick updates from commit f1ba20c to 5f31f35.

37e305e

SW-49945: rebase for multi-backends

4442b27

Signed-off-by: jiaqi.wang <[email protected]>

SW-52470: adjust skipped ops for merge

de32dc7

Signed-off-by: jiaqi.wang <[email protected]>

Mthreads: adjust ops for multi-backends merge

793a820

Signed-off-by: jiaqi.wang <jiaqi.wang @mthreads.com>

Merge branch 'mthreads/master-250225' of github.com:FlagOpen/FlagGems…

1e6e5fc

… into mthreads/master-250225

MTHREADS: Checking Code Format

1ff406f

Signed-off-by: jiaqi.wang <jiaqi.wang @mthreads.com>

MTHREADS: fix bugs for multi-backends ops

3ffd00c

Signed-off-by: jiaqi.wang <jiaqi.wang @mthreads.com>

Galaxy1458 reviewed Feb 26, 2025

View reviewed changes

kiddyjinjin reviewed Feb 26, 2025

View reviewed changes

Galaxy1458 reviewed Feb 26, 2025

View reviewed changes

machuanjiang force-pushed the mthreads/master-250225 branch 2 times, most recently from 2a1aa57 to f0b0476 Compare February 26, 2025 09:45

MTHREADS: fix according to review issue

4a0fda6

Signed-off-by: chuanjiang.ma <[email protected]>

machuanjiang force-pushed the mthreads/master-250225 branch from f0b0476 to 4a0fda6 Compare February 26, 2025 14:15

MTHREADS: add annotation to explain the log-file-cross calc

2749796

Signed-off-by: chuanjiang.ma <[email protected]>

machuanjiang changed the title ~~upstream of mthreads' fork~~ mthreads: support multi-backend and introduce mthreads backend Feb 27, 2025

machuanjiang requested review from kiddyjinjin and Galaxy1458 February 27, 2025 07:10

Galaxy1458 approved these changes Feb 27, 2025

View reviewed changes

Galaxy1458 merged commit d5b888f into master Feb 27, 2025
8 of 9 checks passed

Galaxy1458 deleted the mthreads/master-250225 branch February 27, 2025 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mthreads: support multi-backend and introduce mthreads backend #458

mthreads: support multi-backend and introduce mthreads backend #458

machuanjiang commented Feb 25, 2025 •

edited

Loading

Galaxy1458 Feb 26, 2025

machuanjiang Feb 26, 2025

Galaxy1458 Feb 26, 2025

machuanjiang Feb 26, 2025

Galaxy1458 Feb 26, 2025

machuanjiang Feb 26, 2025

Galaxy1458 Feb 26, 2025

machuanjiang Feb 26, 2025

Galaxy1458 Feb 26, 2025

machuanjiang Feb 26, 2025

Galaxy1458 Feb 26, 2025

machuanjiang Feb 26, 2025

Galaxy1458 Feb 26, 2025

machuanjiang Feb 26, 2025

Galaxy1458 Feb 26, 2025

Galaxy1458 Feb 26, 2025

machuanjiang Feb 26, 2025

kiddyjinjin Feb 26, 2025

machuanjiang Feb 26, 2025

kiddyjinjin Feb 26, 2025

machuanjiang Feb 26, 2025

kiddyjinjin Feb 26, 2025

kiddyjinjin Feb 26, 2025

machuanjiang Feb 26, 2025

kiddyjinjin Feb 27, 2025

machuanjiang Feb 27, 2025

Galaxy1458 Feb 26, 2025

machuanjiang Feb 26, 2025

Galaxy1458 left a comment

		@@ -138,6 +138,19 @@ def to_reference(inp, upcast=False):
		return ref_inp


		def to_reference_fp64(inp, upcast=False):

		@@ -119,6 +119,7 @@ def test_accuracy_outer(M, N, dtype):
		gems_assert_close(res_in2_grad, ref_in2_grad, dtype, reduce_dim=M)


		@pytest.mark.skip("Segmentation fault")

mthreads: support multi-backend and introduce mthreads backend #458

mthreads: support multi-backend and introduce mthreads backend #458

Conversation

machuanjiang commented Feb 25, 2025 • edited Loading

PR Category

Type of Change

Description

Issue

Progress

Performance

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Galaxy1458 left a comment

Choose a reason for hiding this comment

machuanjiang commented Feb 25, 2025 •

edited

Loading