Skip to content

Conversation

solaslin
Copy link
Contributor

@solaslin solaslin commented Sep 2, 2025

Motivation

When tuning/benchmarking swizzling kernels. The client sides (Both in tensilelite and hipblaslt) need to do extra memory re-layout (pre-shuffle) in order to make sure we pass then validations. But the memory OP takes quite significant time.

In our practicing, we often "do no validation" when tuning, and "do validation" in the LibraryClient stage. The point is that if we are working to get the times/flops of kernels only, we usually manually comment out the codes of "pre-shuffle" part to reduce the overhead. But once we need to do validation, the work is unavoidable.

This PR is doing this for us: if no validation, then don't put extra effort on pre-shuffling, otherwise it is a must.

Technical Details

Only do permute (a.k.a., data re-layout, pre-shuffle...etc) when we need to do validation (in hipblaslt-bench, -v or --verify).

Test Plan

Already covered by CI since tox and gtest will do validation.

Test Result

Submission Checklist

@solaslin solaslin self-assigned this Sep 2, 2025
@solaslin solaslin changed the title [hipblaslt] Avoid extra overheads when running benchmarks without validation [hipblaslt] Avoid extra pre-shuffle overheads when running benchmarks without validation Sep 2, 2025
@solaslin solaslin added the NoCI Don't run CI label Sep 3, 2025
@solaslin solaslin force-pushed the users/solaslin/benchmarking-avoids-swizzle-op-when-no-verify branch from 17814f7 to 31e0d8b Compare September 3, 2025 12:16
eidenyoshida pushed a commit that referenced this pull request Sep 3, 2025
@solaslin solaslin force-pushed the users/solaslin/benchmarking-avoids-swizzle-op-when-no-verify branch from 31e0d8b to 47e0910 Compare September 4, 2025 04:07
@solaslin solaslin force-pushed the users/solaslin/benchmarking-avoids-swizzle-op-when-no-verify branch from 47e0910 to e89cea2 Compare September 4, 2025 07:11
@solaslin solaslin force-pushed the users/solaslin/benchmarking-avoids-swizzle-op-when-no-verify branch from e89cea2 to 7718a18 Compare September 4, 2025 09:08
@solaslin solaslin marked this pull request as ready for review September 4, 2025 09:09
@solaslin solaslin requested a review from a team as a code owner September 4, 2025 09:09
Comment on lines -1926 to 1933
if(do_swizzle_a)
// if we are not going to do verify / validation, we don't need to do extra swizzle (pre-shuffle).
// In customers' real case, they will do the swizzle (pre-shuffle) in advance.
// In order to reduce the overhead of doing the pre-shuffle, we can choose not to do it when no --verify / -v
if(do_swizzle_a && (arg.unit_check || arg.norm_check || arg.allclose_check))
{
HipHostBuffer tmp(TiA, size_dA[i]);
swizzle_tensor_type(tmp, hA[i], TiA, arg, num_batches[i], M[i], K[i], lda[i], false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only modification in hipblaslt-bench side, others are all formatting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance result could be different for odd M,N,K problem, since the kernel could compute uninitialized data.

Comment on lines 2065 to 2068

if(needSwizzle)
// When needSwizzle, if no need to do validation, we can save the time doing data-relayout
if(needSwizzle && m_elementsToValidate)
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as hipblaslt side, if no validation, then we don't do extra memory re-layout operations. But still need to make sure we are passing global memory with sufficient size (even with auto-padding, which is done in the ctor "getSwizzledTensorNumAllocatedElements").

Comment on lines 950 to 956
problem.tensors()[i], MiM_N, MiK, PackK);
numAllocatedBytes
= numAllocatedElements * rocisa::GetElementSize(dataType);

// std::cout << "DataInitialization- needSwizzle: numAllocatedElements:"
// << numAllocatedElements << std::endl;
}
Copy link
Contributor Author

@solaslin solaslin Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if no validation, the "pristine.maxElements" has already considered the padded swizzled memory so it is safe to use it when no validation.

@solaslin solaslin removed the NoCI Don't run CI label Sep 4, 2025
@solaslin
Copy link
Contributor Author

solaslin commented Sep 4, 2025

Lots of diff in testing_matmul.hpp is formatting. I've already put self-review at the key part.

@@ -1923,7 +1924,10 @@ void testing_matmul_with_bias(const Arguments& arg,
CHECK_HIP_ERROR(synchronize(hC[i], dC[i]));
}

if(do_swizzle_a)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do_swizzle_a should be removed from Line#1912 as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants