-
Notifications
You must be signed in to change notification settings - Fork 113
[hipblaslt] Avoid extra pre-shuffle overheads when running benchmarks without validation #1429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[hipblaslt] Avoid extra pre-shuffle overheads when running benchmarks without validation #1429
Conversation
17814f7
to
31e0d8b
Compare
31e0d8b
to
47e0910
Compare
47e0910
to
e89cea2
Compare
e89cea2
to
7718a18
Compare
if(do_swizzle_a) | ||
// if we are not going to do verify / validation, we don't need to do extra swizzle (pre-shuffle). | ||
// In customers' real case, they will do the swizzle (pre-shuffle) in advance. | ||
// In order to reduce the overhead of doing the pre-shuffle, we can choose not to do it when no --verify / -v | ||
if(do_swizzle_a && (arg.unit_check || arg.norm_check || arg.allclose_check)) | ||
{ | ||
HipHostBuffer tmp(TiA, size_dA[i]); | ||
swizzle_tensor_type(tmp, hA[i], TiA, arg, num_batches[i], M[i], K[i], lda[i], false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only modification in hipblaslt-bench side, others are all formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance result could be different for odd M,N,K problem, since the kernel could compute uninitialized data.
|
||
if(needSwizzle) | ||
// When needSwizzle, if no need to do validation, we can save the time doing data-relayout | ||
if(needSwizzle && m_elementsToValidate) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as hipblaslt side, if no validation, then we don't do extra memory re-layout operations. But still need to make sure we are passing global memory with sufficient size (even with auto-padding, which is done in the ctor "getSwizzledTensorNumAllocatedElements").
problem.tensors()[i], MiM_N, MiK, PackK); | ||
numAllocatedBytes | ||
= numAllocatedElements * rocisa::GetElementSize(dataType); | ||
|
||
// std::cout << "DataInitialization- needSwizzle: numAllocatedElements:" | ||
// << numAllocatedElements << std::endl; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if no validation, the "pristine.maxElements" has already considered the padded swizzled memory so it is safe to use it when no validation.
Lots of diff in testing_matmul.hpp is formatting. I've already put self-review at the key part. |
@@ -1923,7 +1924,10 @@ void testing_matmul_with_bias(const Arguments& arg, | |||
CHECK_HIP_ERROR(synchronize(hC[i], dC[i])); | |||
} | |||
|
|||
if(do_swizzle_a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do_swizzle_a should be removed from Line#1912 as well
Motivation
When tuning/benchmarking swizzling kernels. The client sides (Both in tensilelite and hipblaslt) need to do extra memory re-layout (pre-shuffle) in order to make sure we pass then validations. But the memory OP takes quite significant time.
In our practicing, we often "do no validation" when tuning, and "do validation" in the LibraryClient stage. The point is that if we are working to get the times/flops of kernels only, we usually manually comment out the codes of "pre-shuffle" part to reduce the overhead. But once we need to do validation, the work is unavoidable.
This PR is doing this for us: if no validation, then don't put extra effort on pre-shuffling, otherwise it is a must.
Technical Details
Only do permute (a.k.a., data re-layout, pre-shuffle...etc) when we need to do validation (in hipblaslt-bench, -v or --verify).
Test Plan
Already covered by CI since tox and gtest will do validation.
Test Result
Submission Checklist