Fix atomic_ref narrow-type memcheck false positive #6442
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
__atomic_ref_small_tagandreference_small.hsocuda::atomic_refon sub-4-byte types performs byte-granular locking instead of widened RMWsatomic_ref<int8_t>regression test covering host and device access patterns (sanitizer-friendly)Motivation
#6430reports thatcuda::atomic_refon narrow types trips compute-sanitizer memcheck because the implementation promotes byte writes to 32-bit RMWs. The sanitizer flags that widened access as an invalid global read, which blocks users (e.g., libcudf) from running their pipelines under memcheck.Explanation
The patch introduces a new storage tag for narrow
atomic_refinstances and backs it with a small device-side lock table. Each operation acquires a byte-granular lock, performs the necessary load/store/update, and releases the lock. Host execution still relies on the existing libatomic wrappers. This approach keeps behavior identical while preventing memcheck from seeing out-of-bounds accesses.Rationale
atomic_refstorage layer; existing owning atomics and the dispatch macros stay untouched.atomic_ref<int8_t>on both host and device, proving the narrow path works end-to-end (and keeps sanitizer latency low).Testing
nvcc -std=c++20 -x cu libcudacxx/test/libcudacxx/std/atomics/atomics.types.generic/integral/8b_integral_ref.pass.cpp -Ilibcudacxx/include -Ilibcudacxx/test/support -o /tmp/atomic_ref_small_testcompute-sanitizer --tool memcheck /tmp/atomic_ref_small_test@miscco @griwes