Sender Side Cache Bypass #1860

alex-breslow-amd · 2025-08-14T17:25:12Z

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
Bypass hardware caches on stores and maybe loads.

Why were the changes made?
Seems to be about 4% average uplift for certain GPUs for bfloat16 allreduce for single node. It could be more or less depending on the GPU type.

Cache invalidations and writebacks can be costly. Let's just make this a nonissue. I would like to eventually get us off extended-scope fine-grain memory and just use cache bypassing stores and loads. Then, the mtype of an allocation shouldn't matter. This is important for user-registered buffers where we can't count on the buffers being allocated in extended-scope fine-grain memory. In that case, on sender, we have to execute a system scope release fence. Note, this is currently incorrect in the code in the general case because we use a device scope acquire-release fence. That is only valid when the protocol buffer is marked as extended-scope fine-grain memory.

How was the outcome achieved?
Certain AMD GPUs support hardware cache bypassing loads and stores. I'm exploiting that here.

Additional Documentation:
Seeing some nice performance improvement here for a range of message sizes.

Approval Checklist

Do not approve until these items are satisfied.

Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

wenkaidu · 2025-08-15T17:06:11Z

src/device/op128.h

-    __builtin_nontemporal_store(value.u64[0], (uint64_t*)addr); \
-    __builtin_nontemporal_store(value.u64[1], (uint64_t*)addr+1); \
+    /*store_bytepack16_##space(addr, value);*/ \
+    asm ("global_store_dwordx4 %0, %1 off sc0 sc1" :: "v"((addr)), "v"((value.u64_vec2))); \


I think need "nt" flag as well to bypass MALL cache

There appears to be a ~1% performance overhead from setting the nt bit. For multinode, it is significantly more.

wenkaidu

This is fine as short term stopgap solution. Below PR is more complete. It is supposed to support all gfx and with unit tests. However, the UT was not implemented correctly.
#1476

thananon · 2025-08-21T20:06:25Z

Bunch of failure on cpx.

alex-breslow-amd · 2025-08-21T20:15:45Z

Bunch of failure on cpx.

Yup, need to debug.

alex-breslow-amd changed the title ~~[WIP] Cache bypass~~ Sender Side Cache Bypass Aug 15, 2025

wenkaidu reviewed Aug 15, 2025

View reviewed changes

Implement cache bypassing stores dwordx4.

b079ea2

alex-breslow-amd force-pushed the cache_bypass branch from ccf9473 to b079ea2 Compare August 15, 2025 17:50

alex-breslow-amd and others added 2 commits August 18, 2025 10:26

Merge branch 'ROCm:develop' into cache_bypass

252c211

Include nt bit based on Wenkai's asks.

df261e0

wenkaidu approved these changes Aug 18, 2025

View reviewed changes

Remove nt flag again based on experimental data

da893c1

alex-breslow-amd requested review from PJAvinash, atulkulk, ddebonis-amd and amd-mengshwu as code owners August 21, 2025 01:45

alex-breslow-amd requested a review from Kapil-Shyam-Pawar as a code owner August 21, 2025 01:45

Complete merge

fe06f43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sender Side Cache Bypass #1860

Sender Side Cache Bypass #1860

alex-breslow-amd commented Aug 14, 2025 •

edited

Loading

Uh oh!

wenkaidu Aug 15, 2025

Uh oh!

alex-breslow-amd Aug 15, 2025 •

edited

Loading

Uh oh!

wenkaidu left a comment

Uh oh!

thananon commented Aug 21, 2025

Uh oh!

alex-breslow-amd commented Aug 21, 2025

Uh oh!

Uh oh!

Sender Side Cache Bypass #1860

Are you sure you want to change the base?

Sender Side Cache Bypass #1860

Conversation

alex-breslow-amd commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Approval Checklist

Uh oh!

wenkaidu Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

alex-breslow-amd Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenkaidu left a comment

Choose a reason for hiding this comment

Uh oh!

thananon commented Aug 21, 2025

Uh oh!

alex-breslow-amd commented Aug 21, 2025

Uh oh!

Uh oh!

alex-breslow-amd commented Aug 14, 2025 •

edited

Loading

alex-breslow-amd Aug 15, 2025 •

edited

Loading