Skip to content

Conversation

alex-breslow-amd
Copy link
Contributor

@alex-breslow-amd alex-breslow-amd commented Aug 14, 2025

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
Bypass hardware caches on stores and maybe loads.

Why were the changes made?
Seems to be about 4% average uplift for certain GPUs for bfloat16 allreduce for single node. It could be more or less depending on the GPU type.

Cache invalidations and writebacks can be costly. Let's just make this a nonissue. I would like to eventually get us off extended-scope fine-grain memory and just use cache bypassing stores and loads. Then, the mtype of an allocation shouldn't matter. This is important for user-registered buffers where we can't count on the buffers being allocated in extended-scope fine-grain memory. In that case, on sender, we have to execute a system scope release fence. Note, this is currently incorrect in the code in the general case because we use a device scope acquire-release fence. That is only valid when the protocol buffer is marked as extended-scope fine-grain memory.

How was the outcome achieved?
Certain AMD GPUs support hardware cache bypassing loads and stores. I'm exploiting that here.

Additional Documentation:
Seeing some nice performance improvement here for a range of message sizes.

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

__builtin_nontemporal_store(value.u64[0], (uint64_t*)addr); \
__builtin_nontemporal_store(value.u64[1], (uint64_t*)addr+1); \
/*store_bytepack16_##space(addr, value);*/ \
asm ("global_store_dwordx4 %0, %1 off sc0 sc1" :: "v"((addr)), "v"((value.u64_vec2))); \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think need "nt" flag as well to bypass MALL cache

Copy link
Contributor Author

@alex-breslow-amd alex-breslow-amd Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appears to be a ~1% performance overhead from setting the nt bit. For multinode, it is significantly more.

Copy link
Collaborator

@wenkaidu wenkaidu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as short term stopgap solution. Below PR is more complete. It is supposed to support all gfx and with unit tests. However, the UT was not implemented correctly.
#1476

@thananon
Copy link
Contributor

Bunch of failure on cpx.

@alex-breslow-amd
Copy link
Contributor Author

Bunch of failure on cpx.

Yup, need to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants