-
Notifications
You must be signed in to change notification settings - Fork 169
Sender Side Cache Bypass #1860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Sender Side Cache Bypass #1860
Conversation
src/device/op128.h
Outdated
__builtin_nontemporal_store(value.u64[0], (uint64_t*)addr); \ | ||
__builtin_nontemporal_store(value.u64[1], (uint64_t*)addr+1); \ | ||
/*store_bytepack16_##space(addr, value);*/ \ | ||
asm ("global_store_dwordx4 %0, %1 off sc0 sc1" :: "v"((addr)), "v"((value.u64_vec2))); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think need "nt" flag as well to bypass MALL cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a ~1% performance overhead from setting the nt bit. For multinode, it is significantly more.
ccf9473
to
b079ea2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine as short term stopgap solution. Below PR is more complete. It is supposed to support all gfx and with unit tests. However, the UT was not implemented correctly.
#1476
Bunch of failure on cpx. |
Yup, need to debug. |
Details
Do not mention proprietary info or link to internal work items in this PR.
Work item: "Internal", or link to GitHub issue (if applicable).
What were the changes?
Bypass hardware caches on stores and maybe loads.
Why were the changes made?
Seems to be about 4% average uplift for certain GPUs for bfloat16 allreduce for single node. It could be more or less depending on the GPU type.
Cache invalidations and writebacks can be costly. Let's just make this a nonissue. I would like to eventually get us off extended-scope fine-grain memory and just use cache bypassing stores and loads. Then, the mtype of an allocation shouldn't matter. This is important for user-registered buffers where we can't count on the buffers being allocated in extended-scope fine-grain memory. In that case, on sender, we have to execute a system scope release fence. Note, this is currently incorrect in the code in the general case because we use a device scope acquire-release fence. That is only valid when the protocol buffer is marked as extended-scope fine-grain memory.
How was the outcome achieved?
Certain AMD GPUs support hardware cache bypassing loads and stores. I'm exploiting that here.
Additional Documentation:
Seeing some nice performance improvement here for a range of message sizes.
Approval Checklist
Do not approve until these items are satisfied.