-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: UCS/ARCH: Avoid direct usage of REP MOVSB in favor of memcpy #10356
base: master
Are you sure you want to change the base?
RFC: UCS/ARCH: Avoid direct usage of REP MOVSB in favor of memcpy #10356
Conversation
Glibc has two tunable parameters that control the use of 'REP MOVSB': a) __x86_rep_movsb_threshold b) __x86_rep_movsb_stop_threshold Ref: https://elixir.bootlin.com/glibc/glibc-2.34/source/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Users can set the desired value of __x86_rep_movsb_threshold through the tunable glibc.cpu.x86_rep_movsb_threshold and can decide which length range uses ERMS and which range uses vectorized routines. Ref: https://www.gnu.org/software/libc/manual/html_node/Tunables.html These tunable parameters are set based on the x86 CPU type and available hardware features. Instead of replicating this infrastructure in UCX, we can utilize the 'REP MOVSB' mechanism provided by memcpy().
@arun-chandran-edarath setting the global configs would affect all invocations of memcpy, but we only want the affect those done in specific contexts. It will even affect memcpy calls done in user application outside of UCX. |
I was not suggesting to use global configs of glibc to control the memcpy() behaviour while running ucx. My point is choice of 'REP MOVSB' instruction to copy data depends on lot of hardware factors and glibc does a good job in deciding when to use it. We are using a miniature version of that decision logic in ucx by using only two variables builtin_memcpy_min and builtin_memcpy_max, which may not give the right result on every x86 hardware. |
@arun-chandran-edarath we observed that intoroducing the "rep movsb" memory copy does improve the performance in some cases. was the "__x86_rep_movsb_threshold" availbale in older glibc versions as well (rh7/rh8)? |
I think rh7 glibc version is 2.17 and rh8 version is 2.28. The tunable "__x86_rep_movsb_threshold" came much later in version 2.32 Checking the version historyarun@arun-ubuntu-2310:~/glibc$ git blame sysdeps/x86/dl-tunables.list 46b5e98ef6f (Noah Goldstein 2024-05-24 12:38:51 -0500 33) x86_memset_non_temporal_threshold { arun@arun-ubuntu-2310:~/glibc$ git show 3f4b61a0b8d
|
So i think that we need to keep ucx internal "rep movsb" code, maybe disabale it depending to glibc version or the presence of x86_rep_stosb_threshold symbol |
Even though rh8 is using glibc version 2.28 it supports tuning knob 'glibc.cpu.x86_rep_movsb_threshold', that means they have backported the fix to 2.28. |
rh7 is in it's 'Extended life cycle support'
Do we need to care for such older os-releases? |
Please look at the bug below for a case when the distance between src and dst degrades the 'rep movsb' performance. https://sourceware.org/bugzilla/show_bug.cgi?id=27130 Commit: https://sourceware.org/git/?p=glibc.git;a=commit;h=3ec5d83d2a237d39e7fd6ef7a0bc8ac4c171a4a5 Description: When copying with "rep movsb", if the distance between source and |
Glibc has the below logic depending on the cpu features to set default values for 'rep_movsb_threshold' and decide the window of usage for 'rep movsb' https://codebrowser.dev/glibc/glibc/sysdeps/x86/dl-cacheinfo.h.html#963 @yosefe what do you think? keep the current code as it is, and disable compiling it by default with '#undef ENABLE_BUILTIN_MEMCPY' while configuring ucx? Note: The current value set for builtin_memcpy_min, in UCX for Intel processors is 1KB, it differs from what glibc uses as min value, it is 2112 for glibc for processors with FSRM feature. |
I think that in order to disable by default or remove the existing code we need to ensure it does not make performance worse. Today it is enabled on both Intel and AMD CPUs, so need to check on both I guess. Also, the current code enables inlining the memcpy function in the calling code (though AFAIU some compilers can do it anyway with builtin memcpy) |
Currently, it is disabled by default (at runtime) on AMD CPUs in UCX, and I am not sure if any users override this via the UCX command line. Perhaps we could consider a compile-time check and disabling it for AMD CPUs? From src/ucs/arch/cpu.c:
So I think we need to check performance degradation only on Intel CPUs, if we want to disable/remove it. Generally, using 'rep movsb' is not recommended on Zen3+ CPUs from AMD due to performance issues observed when 'src' and 'dst' are not aligned (https://sourceware.org/bugzilla/show_bug.cgi?id=30994). However, we see slight performance improvements when | src -dst | = 0 or multiple of 32 |
I don't think we can can disable it by default at compile time (at least, not yet), but maybe add configure flag to allow custom builds to disable it?
Right
So do we need to make further adjustments in the code to make it optimal for AMD CPU? |
I will explore this further
It helps only the cases where |src-dst| & 0x1f == 0, we have seen the performance improvements in raw memcpy benchmarks. To incorporate this to ucx we need to add additional branches, other unaligned cases have to go through the regular memcpy() path. So we need a very detailed investigation to check the usefulness of it within ucx framework. Then another problem will arise, whether to use nt_buffer_transfer or 'rep movsb' for a given 'len' ? The 'rep movsb' code path will win the selection as it uses 'len' for comparison and a transfer of length 'total_len' can get divided into length 'len' that can fall between 'builtin_memcpy_min' & builtin_memcpy_max. The code is given below. |
Glibc has two tunable parameters that control the use of 'REP MOVSB':
a) __x86_rep_movsb_threshold
b) __x86_rep_movsb_stop_threshold
Ref: https://elixir.bootlin.com/glibc/glibc-2.34/source/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
Users can set the desired value of __x86_rep_movsb_threshold through the tunable glibc.cpu.x86_rep_movsb_threshold and can decide which length range uses ERMS and which range uses
vectorized routines.
Ref: https://www.gnu.org/software/libc/manual/html_node/Tunables.html
These tunable parameters are set based on the x86 CPU type and available hardware features. Instead of replicating this infrastructure in UCX, we can utilize the 'REP MOVSB' mechanism via memcpy().