-
Notifications
You must be signed in to change notification settings - Fork 488
UCT/MM: Replace CAS with FAA #10705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
UCT/MM: Replace CAS with FAA #10705
Conversation
56e4612
to
4c5fce2
Compare
Signed-off-by: Arun Chandran <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general it's a nice idea, maybe we should still check FIFO availability before doing FAA (to avoid reserving a slot and not using it immediately), and only if we did FAA and another sender increased the head, and only now we're out of resources, we use the reserved_head stragegy.
also need to pay careful attention to uct_mm_ep_flush, and that the resource checking logic in uct_mm_ep_process_pending and am_send_common is the same.
uct_mm_ep_signal_remote(ep); | ||
/* Try to minimize number of ranks that do _fetch_and_and_ in | ||
* the case of heavy contention */ | ||
if ((ep->fifo_ctl->head - head) == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this check really needed given that if (prev_head & UCT_MM_IFACE_FIFO_HEAD_EVENT_ARMED)
already makes sure only one rank would call uct_mm_ep_signal_remote?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this check really needed given that if (prev_head & UCT_MM_IFACE_FIFO_HEAD_EVENT_ARMED) already makes sure only one rank would call uct_mm_ep_signal_remote?
Let's assume multiple writers are simultaneously trying to communicate with a reader whose FIFO is already 'ARMED'.
In this case, as we are using FAA instead of CAS, all writers will see the 'ARM' bit.
If we don't have the '((ep->fifo_ctl->head - head) == 1)' check, all writers would perform 'ucs_atomic_fand64(&ep->fifo_ctl->head, ~UCT_MM_IFACE_FIFO_HEAD_EVENT_ARMED)', increasing unnecessary atomic operations.
This check helps reduce the number of writers doing atomic_fetch_and_and.
We could avoid it if signaling is infrequent among participants.
Can you comment on how often signaling typically occurs among workers (ranks)?
/* check if resources became available */ | ||
if (uct_mm_ep_has_tx_resources(ep)) { | ||
/* check if we can use the already reserved slot in FIFO */ | ||
if (UCT_MM_EP_IS_ABLE_TO_SEND(ep->reserved_head, ep->cached_tail, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe uct_mm_ep_has_tx_resources should be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try this way, seems better.
@yosefe Thank you for the review.
The drawback of checking FIFO availability before FAA is that it would bring the cache line containing 'ep->fifo_ctl->head' (likely from another core) to the current core, wasting cycles compared to directly doing FAA. This also doesn't guarantee that 'head' won't go beyond the FIFO size when multiple writers are trying for the slot simultaneously. To avoid the extra read, I chose direct FAA. This also prevents starvation, as slots are now distributed in a first-come, first-served manner, unlike the CAS method where late writers could steal slots.
Good point, I did not change the resource checking logic in uct_mm_ep_flush(). I will update it. Could you clarify in which scenarios uct_mm_ep_flush() is used? |
RFC: Is there a better alternative for CAS(Compare and exchange)?
This proof of concept is motivated by the contention and
performance issues observed with CAS in an intra-node with a larger number
of ranks.
My intention is to:
a) Highlight the challenges with CAS in high-contention; this usually happens
when multiple ranks try to send a buffer to a single rank in a FULL_MPI
workload with 256 ranks or more.
b) Propose FAA (Fetch and add) as a potentially better alternative.
c) Gather feedback, concerns, or suggestions from UCX-experts.
Is there any other FIFO design that can solve this issue in
a more optimized and memory-efficient way?
References:
https://medium.com/@pravvich/cas-and-faa-through-the-eyes-of-a-java-developer-8a028f213624
https://drops.dagstuhl.de/storage/00lipics/lipics-vol146-disc2019/LIPIcs.DISC.2019.28/LIPIcs.DISC.2019.28.pdf
I can envision the following issues with the POC.
a) The reserved_head is saved globally in ep->reserved_head and used
later, if the writer cannot immediately write to the reader's FIFO.
This could fail for UCS_THREAD_MODE_MULTI, right?
b) A writer holding a reserved_head in a reader's FIFO might waste
polling cycles of a reader if the reserved slot is not
used before the reader reaches that particular slot to poll for new data.
We can minimize this problem by increasing the FIFO size (UCX_SYSV_FIFO_SIZE)
and more aggressively advancing the 'tail' from the reader's end (UCX_SYSV_FIFO_RELEASE_FACTOR)
--Arun