[hipblaslt] Avoid extra pre-shuffle overheads when running benchmarks without validation #1429

solaslin · 2025-09-02T19:28:59Z

Motivation

When tuning/benchmarking swizzling kernels. The client sides (Both in tensilelite and hipblaslt) need to do extra memory re-layout (pre-shuffle) in order to make sure we pass then validations. But the memory OP takes quite significant time.

In our practicing, we often "do no validation" when tuning, and "do validation" in the LibraryClient stage. The point is that if we are working to get the times/flops of kernels only, we usually manually comment out the codes of "pre-shuffle" part to reduce the overhead. But once we need to do validation, the work is unavoidable.

This PR is doing this for us: if no validation, then don't put extra effort on pre-shuffling, otherwise it is a must.

Technical Details

Only do permute (a.k.a., data re-layout, pre-shuffle...etc) when we need to do validation (in hipblaslt-bench, -v or --verify).

Test Plan

Already covered by CI since tox and gtest will do validation.

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…1429) [ROCm/rocFFT commit: f899388]

solaslin · 2025-09-04T09:10:46Z

projects/hipblaslt/clients/common/include/testing_matmul.hpp

-        if(do_swizzle_a)
+        // if we are not going to do verify / validation, we don't need to do extra swizzle (pre-shuffle).
+        // In customers' real case, they will do the swizzle (pre-shuffle) in advance.
+        // In order to reduce the overhead of doing the pre-shuffle, we can choose not to do it when no --verify / -v
+        if(do_swizzle_a && (arg.unit_check || arg.norm_check || arg.allclose_check))
        {
            HipHostBuffer tmp(TiA, size_dA[i]);
            swizzle_tensor_type(tmp, hA[i], TiA, arg, num_batches[i], M[i], K[i], lda[i], false);


the only modification in hipblaslt-bench side, others are all formatting.

The performance result could be different for odd M,N,K problem, since the kernel could compute uninitialized data.

solaslin · 2025-09-04T09:13:58Z

projects/hipblaslt/tensilelite/client/src/DataInitialization.cpp


-                if(needSwizzle)
+                // When needSwizzle, if no need to do validation, we can save the time doing data-relayout
+                if(needSwizzle && m_elementsToValidate)
                {


Same as hipblaslt side, if no validation, then we don't do extra memory re-layout operations. But still need to make sure we are passing global memory with sufficient size (even with auto-padding, which is done in the ctor "getSwizzledTensorNumAllocatedElements").

solaslin · 2025-09-04T09:15:46Z

projects/hipblaslt/tensilelite/client/src/DataInitialization.cpp

                                problem.tensors()[i], MiM_N, MiK, PackK);
                            numAllocatedBytes
                                = numAllocatedElements * rocisa::GetElementSize(dataType);
+
+                            // std::cout << "DataInitialization- needSwizzle: numAllocatedElements:"
+                            //           << numAllocatedElements << std::endl;
                        }


even if no validation, the "pristine.maxElements" has already considered the padded swizzled memory so it is safe to use it when no validation.

solaslin · 2025-09-04T09:33:01Z

Lots of diff in testing_matmul.hpp is formatting. I've already put self-review at the key part.

jichangjichang · 2025-09-04T09:52:29Z

projects/hipblaslt/clients/common/include/testing_matmul.hpp

@@ -1923,7 +1924,10 @@ void testing_matmul_with_bias(const Arguments& arg,
            CHECK_HIP_ERROR(synchronize(hC[i], dC[i]));
        }

-        if(do_swizzle_a)


do_swizzle_a should be removed from Line#1912 as well

solaslin self-assigned this Sep 2, 2025

github-actions bot added the project: hipblaslt label Sep 2, 2025

assistant-librarian bot added the organization: ROCm label Sep 2, 2025

solaslin changed the title ~~[hipblaslt] Avoid extra overheads when running benchmarks without validation~~ [hipblaslt] Avoid extra pre-shuffle overheads when running benchmarks without validation Sep 2, 2025

solaslin added the NoCI Don't run CI label Sep 3, 2025

solaslin force-pushed the users/solaslin/benchmarking-avoids-swizzle-op-when-no-verify branch from 17814f7 to 31e0d8b Compare September 3, 2025 12:16

eidenyoshida pushed a commit that referenced this pull request Sep 3, 2025

plan: check multiple devices in rank correctly for global transpose (#…

1b809fd

…1429) [ROCm/rocFFT commit: f899388]

solaslin force-pushed the users/solaslin/benchmarking-avoids-swizzle-op-when-no-verify branch from 31e0d8b to 47e0910 Compare September 4, 2025 04:07

avoid extra works when no validation + Formatted.

da91552

solaslin force-pushed the users/solaslin/benchmarking-avoids-swizzle-op-when-no-verify branch from 47e0910 to e89cea2 Compare September 4, 2025 07:11

don't do pre-shuffle if not neccesary

7718a18

solaslin force-pushed the users/solaslin/benchmarking-avoids-swizzle-op-when-no-verify branch from e89cea2 to 7718a18 Compare September 4, 2025 09:08

solaslin marked this pull request as ready for review September 4, 2025 09:09

solaslin requested a review from a team as a code owner September 4, 2025 09:09

solaslin commented Sep 4, 2025

View reviewed changes

solaslin removed the NoCI Don't run CI label Sep 4, 2025

jichangjichang reviewed Sep 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[hipblaslt] Avoid extra pre-shuffle overheads when running benchmarks without validation #1429

[hipblaslt] Avoid extra pre-shuffle overheads when running benchmarks without validation #1429

solaslin commented Sep 2, 2025 •

edited

Loading

Uh oh!

solaslin Sep 4, 2025

Uh oh!

jichangjichang Sep 4, 2025

Uh oh!

solaslin Sep 4, 2025

Uh oh!

solaslin Sep 4, 2025 •

edited

Loading

Uh oh!

solaslin commented Sep 4, 2025

Uh oh!

jichangjichang Sep 4, 2025

Uh oh!

Uh oh!

[hipblaslt] Avoid extra pre-shuffle overheads when running benchmarks without validation #1429

Are you sure you want to change the base?

[hipblaslt] Avoid extra pre-shuffle overheads when running benchmarks without validation #1429

Conversation

solaslin commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

solaslin Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

jichangjichang Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

solaslin Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

solaslin Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

solaslin commented Sep 4, 2025

Uh oh!

jichangjichang Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

solaslin commented Sep 2, 2025 •

edited

Loading

solaslin Sep 4, 2025 •

edited

Loading